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Compared to planar (i.e., two-dimensional) NAND flash memory, 3D NAND flash memory uses a new flash 
cell design, and vertically stacks dozens of silicon layers in a single chip. This allows 3D NAND flash memory 
to increase storage density using a much less aggressive manufacturing process technology than planar NAND 
flash memory. The circuit-level and structural changes in 3D NAND flash memory significantly alter how 
different error sources affect the reliability of the memory. 

In this paper, through experimental characterization of real, state-of-the-art 3D NAND flash memory chips, 
we find that 3D NAND flash memory exhibits three new error sources that were not previously observed in 
planar NAND flash memory: (1) layer-to-layer process variation, a new phenomenon specific to the 3D nature 
of the device, where the average error rate of each 3D-stacked layer in a chip is significantly different; (2) early 
retention loss, a new phenomenon where the number of errors due to charge leakage increases quickly within 
several hours after programming; and (3) retention interference, a new phenomenon where the rate at which 
charge leaks from a flash cell is dependent on the data value stored in the neighboring cell. 

Based on our experimental results, we develop new analytical models of layer-to-layer process variation 
and retention loss in 3D NAND flash memory. Motivated by our new findings and models, we develop four 
new techniques to mitigate process variation and early retention loss in 3D NAND flash memory. Our first 
technique, Layer Variation Aware Reading (LaVAR), reduces the effect of layer-to-layer process variation by 
fine-tuning the read reference voltage separately for each layer. Our second technique, Layer-Interleaved 
Redundant Array of Independent Disks (LI-RAID), uses information about layer-to-layer process variation to 
intelligently group pages under the RAID error recovery technique in a manner that reduces the likelihood 
that the recovery of a group fails significantly earlier than the recovery of other groups. Our third technique, 
Retention Model Aware Reading (ReMAR), reduces retention errors in 3D NAND flash memory by tracking 
the retention time of the data using our new retention model and adapting the read reference voltage to data 
age. Our fourth technique, Retention Interference Aware Neighbor-Cell Assisted Correction (ReNAC), adapts 
the read reference voltage to the amount of retention interference a page has experienced, in order to re-read 
the data after a read operation fails. These four techniques are complementary, and can be combined together 
to significantly improve flash memory reliability. Compared to a state-of-the-art baseline, our techniques, 
when combined, improve flash memory lifetime by 1.85x. Alternatively, if a NAND flash vendor wants to 
keep the lifetime of the 3D NAND flash memory device constant, our techniques reduce the storage overhead 
required to hold error correction information by 78.9%. 
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1 INTRODUCTION 


Solid-state drives (SSDs), which consist of NAND flash memory chips, are a popular data storage 
medium in modern computer systems. Traditionally, NAND flash memory has employed a planar 
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(i.e., two-dimensional) architecture, where the entire chip resides on a single layer of silicon. In 
planar NAND flash memory, a flash cell is made using a floating-gate transistor, where data is 
represented by the amount of charge stored in the transistor’s floating gate. The amount of charge 
stored in the floating gate determines the threshold voltage of the flash cell transistor (i.e., the 
voltage at which the transistor turns on). 

For planar NAND flash memory, to continually increase the SSD capacity and decrease the cost- 
per-bit of the SSD, flash vendors have been aggressively scaling NAND flash memory to smaller 
manufacturing process technology nodes. This, however, comes at the cost of lower reliability [9, 13, 
69]. Due to a combination of manufacturing process technology limitations and reduced reliability 
of planar NAND flash memory, it has become increasingly difficult for vendors to continue to scale 
the density of planar NAND flash memory chips [11, 31, 80]. 

To overcome this scaling challenge, 3D NAND flash memory has recently been introduced [39, 
45, 80]. Although 3D NAND flash memory is already being deployed at large scale in new computer 
systems, there is a lack of available knowledge on the error characteristics of real 3D NAND flash 
memory chips, which makes it harder to estimate the reliability characteristics of systems that 
employ such chips. Previous publicly-available experimental studies on NAND flash memory errors 
using real flash memory chips (e.g., [4-9, 11, 13-16, 64, 69, 81]) have mostly been on planar NAND 
flash memory devices.! 

We identify that 3D NAND flash memory has three fundamental differences from the most recent 
generation (i.e., 10-15 nm) of planar NAND flash memory, which lead to new error characteristics 
for 3D NAND flash memory that we observe experimentally: (1) 3D NAND flash memory currently 
uses a different flash cell architecture than planar NAND flash memory. Instead of using a floating- 
gate transistor, a cell in 3D NAND flash memory consists of a charge trap transistor [86], which 
stores charge within an insulator. (2) Unlike planar NAND flash memory, 3D NAND flash memory 
vertically stacks multiple layers of silicon together within a single chip. Modern 3D NAND flash 
memory chips typically contain 24-96 stack layers [1, 39, 45, 50, 80, 90]. Due to the high layer count, 
3D NAND flash memory can provide high storage density without needing to scale the process 
technology as aggressively as was done for planar NAND flash memory. (3) While modern planar 
NAND flash memory uses a manufacturing process technology node as small as 10-15 nm [58, 90], 
3D NAND flash memory currently uses a much larger manufacturing process technology node 
(e.g., 30-50 nm [86]). 

Our goal in this work is to (1) identify and understand the new error characteristics of 3D 
NAND flash memory (i.e., those that did not exist previously in planar NAND flash memory), 
and (2) develop new techniques to mitigate prevailing 3D NAND flash memory errors. We aim to 
achieve these goals via rigorous experimental characterization of real, state-of-the-art 3D NAND 
flash memory chips from a major flash vendor. Based on our comprehensive characterization and 
analysis, we identify three new error characteristics that were not previously observed in planar 
NAND flash memory, but are fundamental to the new architecture of 3D NAND flash memory: 


(1) 3D NAND flash memory exhibits layer-to-layer process variation, a new phenomenon specific 
to the 3D nature of the device, where the average error rate of each 3D-stacked layer in a chip 
is significantly different from one another (Section 4.2). We are the first to provide detailed 
experimental characterization results of layer-to-layer process variation in real flash devices in 
open literature. Our results show that the raw bit error rate in the middle layer can be 6x the 
raw bit error rate in the top layer. 


With the exception of our very recent prior work [65], which examined two specific important aspects of 3D NAND flash 
memory reliability: temperature and self-recovery effects. 
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(2) 3D NAND flash memory experiences early retention loss, a new phenomenon where the number 
of errors due to charge leakage increases quickly within several hours after programming, but 
then increases at a much slower rate (Section 4.3). We are the first to perform an extended- 
duration observation of early retention loss. While a prior study [23] examines the impact of 
early retention loss over only the first 5 minutes after data is written, we examine the impact of 
early retention loss over the course of 24 days. Our results show that the retention error rate in 
a 3D NAND flash memory block quickly increases by an order of magnitude within ~3 hours 
after programming. 

(3) 3D NAND flash memory experiences retention interference, a new phenomenon where the 
rate at which charge leaks from a flash cell is dependent on the amount of charge stored in 
neighboring flash cells (Section 4.4). Our results show that charge leaks at a lower rate (i.e., the 
retention loss speed is slower) when the vertically-adjacent cell is in a state that holds more 
charge (i.e., a higher-voltage state). 


Our experimental observations indicate that we must revisit the error models and the error 
mitigation mechanisms devised for planar NAND flash memory, as they are no longer accurate 
for 3D NAND flash memory behavior. To this end, we develop new analytical models of (1) the 
layer-to-layer process variation in 3D NAND flash memory (Section 5.1), and (2) retention loss in 
3D NAND flash memory (Section 5.2). Our models estimate the raw bit error rate (RBER), threshold 
voltage distribution, and the optimal read reference voltage (i.e., the voltage at which the RBER is 
minimized when applied during a read operation) for each flash page. Both models are useful for 
developing techniques to mitigate raw bit errors in 3D NAND flash memory. 

We propose four new techniques to mitigate the unique layer-to-layer process variation and 
early retention loss errors observed in 3D NAND flash memory. Each technique makes use of our 
new analytical models of layer-to-layer process variation and retention loss in 3D NAND flash 
memory. Our first technique, Layer Variation Aware Reading (LaVAR), reduces process variation by 
fine-tuning the read reference voltage independently for each layer. Our second technique, Layer- 
Interleaved Redundant Array of Independent Disks (LI-RAID), improves reliability by changing 
how pages are grouped under the RAID error recovery technique. LI-RAID uses information about 
layer-to-layer process variation to reduce the likelihood that the RAID recovery of a group could 
fail significantly earlier during the flash lifetime than the recovery of other groups. Our third 
technique, Retention Model Aware Reading (ReMAR), reduces retention errors in 3D NAND flash 
memory by tracking the retention time of the data using our retention model and adapting the 
read reference voltage to data age. Our fourth technique, Retention Interference Aware Neighbor- 
Cell Assisted Correction (ReNAC), adapts the read reference voltage to the amount of retention 
interference and re-reads the data after a read operation fails, in order to correct the cells affected 
by retention interference. These four techniques are complementary, and can be combined together 
to significantly improve flash memory reliability. Compared to a state-of-the-art baseline, our 
techniques, when combined, improve flash memory lifetime by 1.85x. Alternatively, if a NAND 
flash vendor wants to keep the lifetime of the 3D NAND flash memory device constant, our 
techniques reduce the storage overhead required to hold error correction information by 78.9%. 

This paper makes the following key contributions: 

e It presents the first comprehensive experimental characterization of real, state-of-the-art 3D NAND 
flash memory chips, and provides an in-depth analysis of layer-to-layer process variation, early 
retention loss, and retention interference, which are three new error characteristics inherent to 
3D NAND flash memory. 
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e It develops new analytical models for (1) layer-to-layer process variation and (2) early retention 
loss, which can be used to estimate the raw bit error rate, mean and standard deviation of the 
threshold voltage distribution of each state, and the optimal read reference voltages. 

e It develops four new mechanisms, LaVAR, LI-RAID, ReMAR, and ReNAC, to mitigate the three new 
error characteristics we have identified in 3D NAND flash memory. It evaluates these techniques, 
and shows that, when applied together, they improve 3D NAND flash memory lifetime by 1.85x, 
or reduce the storage overhead for error correction by 78.9% if we keep the lifetime constant, 
compared to a state-of-the-art baseline. 


2 BACKGROUND 


In this section, we first provide necessary background on the basics of NAND flash memory 
(Section 2.1). Next, we briefly discuss the different known sources of errors within planar NAND 
flash memory (Section 2.2). For an extended background on NAND flash memory, we refer the 
reader to our prior works [9-11]. 


2.1 NAND Flash Memory Basics 


In NAND flash memory, each flash cell consists of a transistor that can store charge. A flash 
cell represents a certain data value based on the threshold voltage (Vn) of its transistor, which 
is determined by the amount of charge stored in it. In multi-level cell (MLC) flash memory, each 
cell stores two bits of data. A threshold voltage window (i.e., state) is assigned for each possible 
two-bit value. Figure 1a shows the four possible states (i.e., ER, P1, P2, P3) in MLC NAND flash 
memory, along with their corresponding bit values. As a result of manufacturing process variation, 
the threshold voltage of cells programmed to the same state follow a Gaussian-like distribution 
across the voltage window of the state [9, 14, 64, 81], depicted as a probability density curve in 
Figure 1a. 
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Fig.1. (a) Threshold voltage distribution and read reference voltages for MLC NAND flash memory; (b) Internal 
organization of a flash block. 


A NAND flash memory chip contains thousands of flash blocks, which are two-dimensional 
arrays of flash cells. Figure 1b shows the internal organization of a flash block. Each block contains 
dozens of rows (i.e., wordlines) of flash cells, where each row typically contains 64K to 128K cells. 
All of the cells on the same wordline are read and programmed together as a group. MLC NAND 
flash memory partitions the two bits of each flash cell in a wordline across two pages, which are 
the unit of data programmed at a time (typically 8 kB). The least significant bits (LSBs) of all cells in 
one wordline form the LSB page of that wordline, and the most significant bits (MSBs) of these cells 


Tolerating Early Retention Loss and Process Variation in 3D NAND Flash Memory 5 


form the MSB page. The sources and drains of cells across different wordlines in the same block are 
connected in series to form a bitline. 

Reads and writes to the flash memory are managed by an SSD controller. The controller reads a 
page from a flash block by applying a read reference voltage (V,ef) to the wordline that holds the 
page. A cell switches on only if V,, > V-er. Figure 1a shows the three read reference voltages (Va, 
Vp, and V.) that are used to distinguish between each state. A sense amplifier is attached to each 
bitline to detect if the cell is switched on. In order to detect the state of a particular cell on the 
bitline, the controller applies a pass-through voltage (Vpass) to the wordlines of all unread cells in 
the flash block. This turns on the unread cells, allowing the value of the cell that is being read to 
propagate through the bitline to the sense amplifier. To guarantee that all unread cells are on, Vpass 
is set to the maximum possible threshold voltage [5, 9]. 

Before new data can be written (i.e., programmed) to a flash page, the controller must first erase the 
entire block (i.e., 512 to 1024 pages) that the page belongs to, due to wiring constraints. After erase, 
all of the cells in the erased block are reset to the ER state. To program a flash cell, the controller 
sends the data to be programmed to the flash chip, which repeatedly pulses a high programming 
voltage on a cell to increase a cell’s threshold voltage until the cell reaches its target state. This 
iterative programming approach is called incremental step pulse programming (ISPP) [3, 69, 89, 91]. 
Each pair of erase and program operations is referred to as a program/erase (P/E) cycle. 


2.2 Errors in NAND Flash Memory 


As vendors work to increase the density of NAND flash memory, they use aggressive manufacturing 
process technology scaling to reduce the size of a flash cell. As a result, each cell has a smaller 
capacity to store charge, and the cells move closer to each other. These changes reduce the reliability 
of the NAND flash memory, thereby increasing the probability of flash memory errors in newer 
generations of planar (i.e., two-dimensional) NAND flash memory. Errors occur when the cell 
threshold voltage (V;;,) unintentionally changes or is read incorrectly, which can alter the cell state 
observed by the controller. Errors can be induced by a range of sources [4-9, 11, 13-16, 65, 69], 
which we divide into four categories: process variation errors, retention errors, write-induced 
errors, and read-induced errors. We briefly describe each error source below, and refer the reader 
to the prior work cited below for detailed explanations of each error source. A comprehensive 
treatment of different types of NAND flash memory errors and mitigation mechanisms for them 
can be found in our recent survey papers [9, 11]. 

Process variation errors occur as a result of the fabrication process. Within a single chip, different 
flash cells have different attributes, due to the lithography limitations of modern manufacturing 
process technologies [13, 84]. As a result, there is inherent variation among the cells, and some 
cells have a higher error rate than other cells. 

Retention errors [6-8] are a type of error that increase and accumulate over time after a flash cell 
is programmed. A retention error occurs because charge leaks out of the transistor over time. As 
charge leaks from a cell, the cell’s threshold voltage (V;;,) decreases. In planar NAND flash memory, 
retention errors are the dominant source of all flash memory errors [6-8, 13], if aggressive refresh 
techniques [7, 8, 63] are not employed. 

Write-induced errors occur during program or erase operations. P/E cycling errors (or pro- 
gram/erase variation errors) [14, 64, 81] are errors that occur immediately after erasing and pro- 
gramming a flash page. These errors occur because of the inaccuracy of each program and erase 
operation. This inaccuracy causes some cells to be programmed into a state other than its desired 
target state. As more P/E cycles take place over the lifetime of a flash cell, the repeated stress causes 
more electrons to become trapped within the transistor, which is known as wearout. Wearout 
increases the inaccuracy during program and erase operations, thereby increasing the number of 
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P/E cycling errors. Cell-to-cell program interference errors [15, 16] are another type of write-induced 
error that increases the threshold voltage of a cell and thereby increases the RBER, when an adjacent 
cell in another wordline is being programmed. Since parasitic capacitance coupling exists between 
cells within close proximity of each other, when a high programming voltage is applied on one 
cell, the capacitance coupling adds charge to the transistors of the adjacent cells, increasing the 
program interference errors. 

Read-induced errors occur during read operations. Read errors [24, 29, 42] are a type of read- 
induced error where two reads to a flash cell may return different data values. A read error occurs 
when the read reference voltage is close to the cell’s threshold voltage. Such an error occurs when 
random fluctuations on the bitline cause the sense amplifier to detect the wrong data. Read disturb 
errors [5, 81] are another type of read-induced error where reading a page in a flash block may 
change the values stored in (ie., increase the RBER) of other pages in the same block. This type of 
error occurs due to the application of the pass-through voltage (Vpass) to unread cells. When one cell 
on a bitline is being read, applying Vpass to the unread cells can induce a weak programming effect 
on the unread cells, slowly transferring electrons into the unread cells’ transistors and increasing 
the threshold voltage of the unread cells. 

To mitigate these errors, SSDs use error-correcting codes (ECC) on the data. ECC has a fixed 
error correction capability: it can correct only a limited number of errors, beyond which the data is 
no longer correctable. When a flash page is uncorrectable, we say that the SSD has reached the end 
of its lifetime. 


3 ARCHITECTURAL DIFFERENCES BETWEEN 3D NAND AND PLANAR NAND 


3D NAND flash memory (or 3D NAND) has three fundamental differences from the most recent ge- 
neration (i.e., 10-15 nm) of planar NAND flash memory: (1) the flash cell design, (2) the organization 
of flash cells within a chip, and (3) the manufacturing process technology node. 

Flash Cell Design. In both planar and 3D NAND flash memory, each flash cell consists of a 
transistor that can store charge, where the amount of charge determines the threshold voltage of 
the cell (i.e., the voltage at which the cell turns on). The vast majority of planar NAND flash memory 
uses a floating-gate transistor (FG) for each cell. Figure 2a illustrates the design of a floating-gate 
cell. A control gate sits at the top of the transistor. Read, program, and erase operations all apply 
a voltage onto the control gate to turn on the cell or to add charge to the transistor. A floating 
gate sits in the middle of the transistor. The floating gate is a conductor that stores the transistor’s 
charge, and is sandwiched by oxide layers. The oxide layers minimize the amount of charge that 
leaks out of the floating gate. At the bottom of the cell is the substrate, which has two terminals on 
either end, marked source (S) and drain (D). When the voltage applied on the control gate is higher 
than the voltage of the charge stored in the floating gate, an electrical channel forms between the 
source and drain, connecting them together. The floating gate voltage can be increased or decreased 
by applying a large positive or negative voltage, respectively, to the control gate, which induces 
Fowler-Nordheim tunneling [27] of electrons through the oxide. 

Instead of floating-gate transistors, most existing 3D NAND flash memory designs use a charge 
trap transistor (CT) for each cell. Figure 2b illustrates the design of a charge trap cell. The substrate, 
and therefore the channel between source and drain, sits vertically in the center of the cell. A 
charge trap layer wraps around the substrate. The charge trap layer takes the place of the floating 
gate, storing the transistor’s charge. However, unlike the floating gate, the charge trap layer is an 
insulator. The control gate still exists in a charge trap cell, but it now wraps around the charge trap 
layer. 

Flash Chip Organization. Figure 3 illustrates the physical organization of flash cells in 3D 
NAND flash memory. The charge trap transistor design allows the bitline (BL in Figure 3) of a 
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Fig. 2. The design of (a) a floating-gate cell, and (b) a 3D charge trap cell. 


block to stand vertically (i.e., along the z-axis) in the chip. In other words, the bitline now connects 
together one charge trap cell from each layer of the chip, as the cells are stacked on top of each 
other. Note that all of the cells along the z-axis share the same charge trap insulator, akin to how 
transistors are connected together on a bitline in planar NAND flash memory. The control gates of 
cells in the same layer, along the y-axis, are connected together to form a wordline. In this figure, 
we show a simple example where the cells in the same y-z plane form a flash block. In reality, to 
form larger flash blocks, multiple stacks of flash cells are connected together to form longer bitlines, 
thus increasing the number of wordlines within a block. Multiple such flash blocks are aligned 
along the x-axis to form a flash chip. 


Block K Substrate 


Charge trap 


Wordline M Control gate 


Wordline 1 


Layer 0 
Wordline O ú 
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Fig. 3. 3D NAND flash memory organization. 


Manufacturing Process Technology. Compared with the most recent generation of planar 
NAND flash memory (i.e., 10-15 nm), 3D NAND flash memory uses a much larger manufacturing 
process technology node (e.g., 30-50 nm) [86]. Because 3D NAND flash memory has a large number 
of layers (typically 24-96 [1, 39, 45, 50, 80, 90]), it can reach the same storage density of the most 
recent planar NAND flash memory generation while using much larger flash cells. 
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4 CHARACTERIZATION OF 3D NAND FLASH MEMORY ERRORS 


Our goal is to identify and understand new error characteristics in 3D NAND flash memory, 
through rigorous experimental characterization of real, state-of-the-art 3D NAND flash memory 
chips. We use the observations and analyses obtained from such characterization to (1) compare 
how the reliability of a 3D NAND flash memory chip differs from that of a planar NAND flash 
memory chip, (2) develop a model of how each new error source affects the error rate of 3D NAND 
flash memory, (3) understand if and how these reliability characteristics will change with future 
generations of 3D NAND flash memory, and (4) develop mechanisms that can mitigate new error 
sources in 3D NAND flash memory. 

For our characterization, we use the methodology discussed in Section 4.1. First, we perform a 
detailed characterization and analysis of three error characteristics that are drastically different 
in 3D NAND flash memory than in planar NAND flash memory: layer-to-layer process variation 
(Section 4.2), early retention loss (Section 4.3), and retention interference (Section 4.4). In addition to 
identifying new error sources in 3D NAND flash memory, we use our methodology to corroborate 
and quantify 3D NAND error characteristics that are a result of error sources that were previously 
identified in planar NAND flash memory, including retention loss [6—9, 11, 23, 80], P/E cycling [9, 
11, 14, 64, 80, 81], program interference [4, 9, 11, 15, 16, 80], read disturb [5, 9, 11, 81], and process 
variation [13, 84]. We summarize our findings for these error types in Section 4.5, and provide 
detailed results on our characterization of these previously-identified error sources in Appendix A. 


4.1 Methodology 


We experimentally characterize several real, state-of-the-art 3D MLC NAND flash memory chips 
from a single vendor.™’ We use a NAND flash characterization platform similar to prior work [4- 
9, 11-16, 64, 65, 81], which allows us to issue read-retry commands directly to the flash chip. The 
read-retry command [9, 14] allows us to fine-tune the read reference voltage used for each read 
operation. The smallest amount by which we can change the read reference voltage is called a 
voltage step. We conduct all experiments at room temperature (20 °C). 

We use two metrics to evaluate 3D NAND flash memory reliability. First, we show the raw bit 
error rate (RBER), which is the rate at which errors occur in the data before error correction. We 
show the RBER for when we read data using the optimal read reference voltage (Vop:), which is the 
read reference voltage that generates the fewest errors in the data.* 

Second, we show how the various error sources change the threshold voltage distribution. These 
changes (i.e., shifting and widening) in threshold voltage distribution directly lead to raw bit errors 
in the flash memory. To obtain the distribution, we first use the read-retry command to sweep over 
all possible voltage values, to identify the threshold voltage of each cell.” Then, we use this data to 
calculate the probability density of each state at every possible threshold voltage value. As part of 
our analysis, we fit the threshold voltage distribution of each state to a Gaussian distribution. We 
use the mean of the Gaussian model to represent how the distribution shifts as a result of errors, and 
we use the standard deviation of the model to represent how the distribution widens. Throughout 
this paper, we present normalized voltage values, as the actual voltage values are proprietary to 
NAND flash memory vendors. A normalized voltage of 1 represents a single fixed voltage step. 


The trends we observe from the characterization are expected be similar for 3D charge trap flash memory manufactured 
by different vendors, as their 3D flash memory organizations are similar in design. 

3We normalize the actual number of stacked layers of the chips and leave out the exact process technology to protect the 
anonymity of the flash vendor and to avoid revealing proprietary information. 

4We show RBER at the optimal read reference voltage to accurately represent the reliability of NAND flash memory, as SSD 
controllers tune the read reference voltage to a near-optimal point to extend the NAND flash lifetime [6, 9, 64, 76]. 

5We refer to prior work for more detail on the methodology to obtain the threshold voltage distribution [14, 64, 81]. 
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We show two examples in Figure 4 to visualize how well this simple Gaussian model captures the 
change in the measured threshold voltage distribution. Figure 4 shows the measured and modeled 
distributions under two conditions: (1) after 0 P/E cycles, 0-day retention time [6], and 0 read 
disturbs (i.e., the data contains few errors); and (2) after 10K P/E cycles, 3-day retention time [6], 
and 900K read disturbs (i.e., the data contains a high number of errors). Dotted points plot the 
measured threshold voltage distributions from the real 3D NAND memory chips. Note that we 
are unable to show the ER state distribution when the P/E cycle count is low (i.e., the black dots), 
because the erase operation cleanly resets the threshold voltage to a negative value that is lower 
than the observable voltage range under a low P/E cycle count. We use a solid line to show a fitted 
Gaussian distribution for each state. The Kullback-Leibler divergence error values [64, 81] of the 
fitted Gaussian distributions are 0.034 and 0.23.° We observe, from this figure, that after the chip is 
used, the threshold voltage distribution shifts due to P/E cycling, retention loss, and read disturb, 
reducing the error margins between neighboring states, and leading to more raw bit errors in the 
data. Thus, depicting and understanding how threshold voltage distributions are affected by various 
factors helps us understand how raw bit errors occur and thus devise mechanisms to mitigate 
various errors more effectively. 


— 0 P/E Cycles, 0-Day Retention, 0 Reads 
| —10K P/E Cycles, 3-Day Retention, 900K Reads 


Probability Density 


0 50 100 150 200 250 300 
Normalized Vin 


Fig. 4. 3D NAND threshold voltage distribution before (black) and after (red) the data is subject to a high 
number of errors (due to P/E cycling, retention loss, and read disturb). 


In the following sections, we directly show the mean and the standard deviation of the fitted 
threshold voltage distributions instead of the distribution itself, to simplify the presentation of our 
results. 

Limitations. In our experiments, we randomly sampled 27 flash blocks throughout our charac- 
terizations. Note that each sampled flash block consists of tens of millions of flash cells. Thus, we 
believe that our observations are representative of the general behavior that takes place in the 
model of 3D NAND chips that we tested. While adding more data samples (i.e., flash blocks to 
test) can add to the statistical strength of our results, we do not believe that this would change 
the general qualitative findings that we make and the models that we develop in this work. This is 
because the new error characteristics we observe are caused by the underlying architecture of 3D 
NAND flash memory (see Section 3). 


$ A KL-divergence error of x means that the model loses x natural units of information (i.e., nats) due to modeling error. 
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Note that we do not characterize chip-to-chip process variation, as an accurate study of such 
variation requires a large-scale study of a large number (e.g., hundreds) of 3D NAND flash memory 
chips, which we do not have access to. Hence, we leave such a large-scale study for future work. 


4.2 Layer-to-Layer Process Variation 


Process variation refers to the variation in the attributes of flash cells when they are fabricated 
(see Section 2.2). Due to process variation, some flash cells can have a higher RBER than others, 
making these cells the limiting factor of overall flash memory reliability. In 3D NAND flash memory, 
process variation can occur along all three axes of the memory (see Figure 3). Among the three axes, 
we expect the variation along the z-axis (i.e., layer-to-layer variation) to be the most significant, 
due to the new challenge of stacking multiple flash cells across layers. Prior work has shown that 
current circuit etching technologies are unable to produce identical 3D NAND cells when punching 
through multiple stacked layers, leading to significant variation in the error characteristics of flash 
cells that reside in different layers [38, 92]. 

To characterize layer-to-layer process variation errors within a flash block, we first wear out 
the block by programming random data to each page in the block until the block endures 10K P/E 
cycles. Then, we compare the collective characteristics of the flash cells in one layer with those 
in another layer. We repeat this experiment for flash blocks on multiple chips to verify all of our 
findings. 

Observations. Figure 5 shows the RBER variation along the z-axis (i.e., across layers) for a flash 
block that has endured 10K P/E cycles. The chips we use for characterization have between 30 and 
40 layers. We normalize the number of layers from 0 (the top-most layer) to 100 (the bottom-most 
layer) by multiplying the actual layer number with a constant, to maintain the anonymity of the 
chip vendors. Figure 5a breaks down the errors according to the originally-programmed state and 
the current state of each cell; Figure 5b breaks down the errors into MSB and LSB page errors. In 
Figure 5b, the solid curve and the dotted curve show the results for two blocks that were randomly 
selected from two different flash chips. We make five observations from Figure 5. First, ER e P1 
and P1 Ə P2 errors vary significantly across layers, while P2 < P3 errors remain similar across 
layers. The variation in ER e P1 errors is mainly caused by the large variation in mean threshold 
voltage of the ER state across layers; the variation in P1 Ə P2 is caused by the variation in the 
threshold voltage distribution width of the P1 state across layers (Section A.4). Second, both the 
MSB and LSB error rates vary significantly across layers. We call this phenomenon layer-to-layer 
process variation. For example, MSB page on normalized layer 55 in the middle (i.e., Max MSB) has 
an RBER 21x that of normalized layer 0. Third, MSB error rates are much higher than LSB error 
rates in a majority of the layers, on average by 2.4x. We call this phenomenon MSB-LSB RBER 
variation. MSB error rates are usually higher than LSB error rates because reading an MSB page 
requires two read reference voltages (V, and V;), whereas reading an LSB page requires only one 
(V,). Fourth, the top half of the layers have lower error rates than the bottom half. This is likely 
caused by the variation in the flash cell size across layers. Fifth, the RBER variation we observe 
is consistent across two randomly-selected blocks from two different chips. This indicates that 
layer-to-layer process variation and MSB-LSB RBER variation are consistent characteristics of 3D 
NAND flash memory. 

Figure 6 shows how the optimal read reference voltages vary across layers. Three subfigures show 
the optimal read reference voltages for Vz, Vp, and Ve. We make two observations from Figure 6. 
First, the optimal voltages for V, and V, vary significantly across layers, but the optimal voltage for 
V, does not change by much. This is because process variation mainly affects the threshold voltage 
distributions of the ER and P1 states, whereas the threshold voltage distributions of the P2 and P3 
states, which are more accurately controlled by ISPP (see Section 2), are similar across layers. We 
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Fig. 5. Variation of RBER across layers. 


discuss this further in Appendix A.4. Second, the optimal read reference voltages for Vz and V, are 
lower for cells in the top half of the layers than for cells in the bottom half. This is because process 
variation significantly affects the threshold voltage of the ER and P1 states (see Appendix A.4). 
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Fig. 6. Variation of optimal read reference voltage across layers. 


Insights. We show that the phenomena of layer-to-layer process variation and MSB-LSB RBER 
variation, which are unique to 3D NAND flash memory, are significant. We refer to Appendix A.4 
for a comparison between layer-to-layer process variation and bitline-to-bitline process variation. 
In the future, as 3D NAND flash devices scale along the z-axis, more layers will be stacked vertically 
along each bitline. This will likely further exacerbate the effect of layer-to-layer process variation, 
making it even more important to study and mitigate its negative effects. 


4.3 Early Retention Loss 


Retention errors are flash memory errors that accumulate after data has been programmed to the 
flash cells [6-8] (see Section 2.2). Because 3D NAND flash memory typically uses a different cell 
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design (i.e., the charge trap cell described in Section 3) than planar NAND flash memory (which 
uses floating-gate cells), it has drastically different retention error characteristics. The charge 
trap flash cells used in 3D NAND flash memory suffer from early retention loss, i.e., fast charge 
loss within a few seconds. This phenomenon has been observed by prior works using circuit- 
level characterization [21, 23]. However, due to limitations of the circuit-level characterization 
methodology used by these prior works, openly-available characterizations of early retention loss 
in 3D charge trap NAND flash devices document retention loss behavior for up to only 5 minutes 
after the data is written (i.e., for a maximum retention time of 5 minutes). This limited window is 
insufficient for understanding early retention loss under real workloads, which typically have much 
longer retention time requirements [63], i.e., the length of time that has elapsed since programming 
until the data is accessed again. 

Our goal is to experimentally characterize early retention loss in 3D NAND flash memory for 
a large range of retention times (e.g., from several minutes to several weeks). First, we randomly 
select 11 flash blocks within each chip and write pseudo-random data to each page within the block 
to wear the blocks out. We wear out each block to a different P/E cycle count, so that we have error 
data for every 1K P/E cycles between 0 and 10K P/E cycles.’ Then, we program pseudo-random data 
to each flash block, and wait for up to 24 days under room temperature. To characterize retention 
loss, we measure the RBER and the threshold voltage distribution at nine different retention times, 
ranging from 7 minutes to 24 days. To minimize the impact of other errors, and to allow us to 
include very low retention times, we characterize only the first 72 flash pages within each block. 
We believe that the observations we make on these flash cells are representative of the entire chip, 
and we can generalize the observations to a majority of 3D NAND flash memory cells. We analyze 
the threshold voltage distribution in Appendix A.2. 

Observations. Figure 7 shows the comparison between the retention error rate of 3D NAND 
and planar NAND flash memory at 10,000 P/E cycles using both a logarithmic time scale on the 
x-axis (Figure 7a) and a linear time scale on the x-axis (Figure 7b) for different retention times 
after programming. To make this comparison, we perform the same experiment as above for 
planar NAND flash memory chips. Due to limitations of the available data, we extend our data 
to the same retention time range using a linear model that was proposed by prior work [65, 69]: 
log(RBER) = A- log(t) + B, where t is the retention time, and A and B are parameters of the linear 
model. The dotted portions of the lines represent the RBER that is predicted by the linear model. 

We make two observations from this figure. First, in Figure 7a, we observe that the retention 
error rate changes much more slowly for planar NAND flash memory than for 3D NAND flash 
memory. Although the 3D NAND flash memory chip has lower RBER than the planar NAND flash 
memory chip shortly after programming, the RBER becomes higher on the 3D NAND flash memory 
chip after 7 x 10° seconds (~2 hours) of retention time. This means that 3D NAND flash memory is 
more susceptible to the retention loss phenomenon than planar NAND flash memory. Second, in 
Figure 7b, we observe that the RBER of 3D NAND flash memory quickly increases by an order of 
magnitude in 10* seconds (~3 hours), and by another order of magnitude in 10° seconds (~11 days). 
However, we do not observe a large difference in retention loss between low and high retention 
times for planar NAND flash memory (also shown by prior works [6, 69]). This shows that the 
retention loss is steep when retention time is low, but the retention loss flattens out when the 
retention time is high. This is a result of the early retention loss phenomenon in 3D NAND flash 
memory. 


7For all experiments throughout the paper, we consistently assume a 0.5-second dwell time, which is the length of time 
between consecutive program/erase operations [65]. 
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Early retention loss can be caused by two possible reasons. First, the tunnel oxide layer is thinner 
in 3D NAND flash memory than in planar NAND flash memory [86, 97]. Since a 3D charge trap cell 
uses an insulator to store charge, which is immune to the short circuiting caused by stress-induced 
leakage current (SILC) [26, 73], the tunnel oxide layer in 3D NAND flash memory is designed to 
be thinner to improve programming speed [80]. This causes charge to leak very fast soon after 
programming. Second, cells connected on the same bitline share the same charge trap layer. As a 
result, charge that is programmed to a flash cell quickly leaks to adjacent cells that are on the same 
bitline due to electron diffusion through the shared charge trap layer [23], which we discuss further 
in Section 4.4. 
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Fig. 7. Retention error rate comparison between 3D NAND and planar NAND flash memory at 10K P/E cycles. 
Dotted portions of lines represent the RBER predicted by the linear model proposed by prior work [65, 69]. 
We show the retention time on the x-axis using both (a) a logarithmic time scale and (b) a linear time scale. 


Figure 8 plots how the optimal read reference voltage changes with retention time. The three 
subfigures show the optimal voltages for Vz, Vp, and Vç. We make three observations from this 
figure. First, the relation between the optimal read reference voltages of V, or V, and the retention 
time can be modeled as [65, 69]: V = A: log(t) + B, similar to the logarithm of RBER (which we 
discuss above). Second, the optimal read reference voltages for V, and V, decrease significantly as 
retention time increases, whereas V, remains relatively constant. Third, due to the early retention 
loss phenomenon, the optimal read reference voltages for Vp and V. change rapidly when the 
retention time is low (e.g., V. changes by 5 voltage steps within the first 3 hours), but they change 
slowly when the retention time is high (e.g., V. changes by another 5 voltage steps after 11 days). 

Insights. We compare the errors caused by retention loss in 3D NAND flash memory to that 
in planar NAND flash memory, using our results in Figure 7 and the results reported in prior 
work [6, 7, 69]. We find two major differences in 3D NAND flash memory, which we summarize 
below. More results and insights are in Appendix A.2. First, 3D NAND flash memory is more 
susceptible to retention errors than planar NAND flash memory, and its error rate increases much 
faster when the retention time is low than when the retention time is high. This is a result of the 
early retention loss phenomenon in 3D NAND flash memory, which is due to the use of a different 
flash cell design and thus is likely to remain in future generations of 3D NAND flash memory. 
Second, the optimal read reference voltages for V, and Vç in 3D NAND flash memory change 
significantly with retention time. However, in planar NAND flash memory, the optimal voltage 
for V, does not change by much [6], indicating that retention loss is a more pressing phenomenon 
in 3D NAND flash memory. This makes adjusting the optimal read reference voltages even more 
important for 3D NAND flash memory than for planar NAND flash memory. We conclude that it is 
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Fig. 8. Optimal read reference voltages for different retention times. Note that the x-axis uses a logarithmic 
time scale. 


necessary to develop novel mechanisms to mitigate the early retention loss phenomenon in 3D 
NAND flash memory. 


4.4 Retention Interference 


Retention interference is the phenomenon that the speed of retention loss for a cell depends on the 
threshold voltage of a vertically-adjacent neighbor cell whose charge trap layer is directly connected 
to the victim cell along the bitline. Retention interference is unique to 3D NAND flash memory, as 
cells along the same bitline in 3D NAND flash memory share the same charge trap layer. If two 
neighboring cells have different threshold voltages over time, charge can leak away from the cell 
with a higher threshold voltage to the cell with a lower threshold voltage [23]. Figure 9 shows an 
example of this phenomenon, where charge leaks from the top cell (which is in a higher-voltage 
state) to the bottom cell (which is in a lower-voltage state) through the shared charge trap layer. 
This charge leakage reduces the threshold voltage of the top cell while increasing the threshold 
voltage of the bottom cell. 


Vertically- 
adjacent 
cell 


Retention 
interference 


Victim 
cell 


Fig. 9. Retention interference phenomenon: a vertically-adjacent cell leaks charge into a victim cell. 


We use the same data used for retention loss in Section 4.3 to observe the effects of retention 
interference. To eliminate any noise due to program interference, we use only the neighboring cells 
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that are programmed before the victim cells to establish the retention interference correlation, as 
these cells do not induce program interference on the victim cells. We also ignore victim cells that 
are in the ER state, as they are significantly affected by program interference even though they are 
programmed after their neighbors [4]. Once program interference is eliminated, the cells should 
experience a similar threshold voltage shift due to retention loss except for the effects of retention 
interference. To find the retention interference, we first group all of the victim cells based on their 
threshold voltage states and the states of their neighboring cells. Then, we compare the amount by 
which the threshold voltages shift over a 24-day retention time, for each group, to observe how the 
cells are affected by the retention interference caused by neighboring cells. 

Observations. Figure 10 shows the average threshold voltage shift over a 24-day retention time, 
broken down by the state of the victim cell (V) and the state of the neighboring cell (N). Each bar 
represents a different (V, N) pair. Different shades represent the different states of the neighboring 
cell, as labeled in the legend. Every 4 bars are grouped by the state of the victim cell, as labeled on 
the y-axis. The length of each bar represents the amount of threshold voltage shift over the 24-day 
retention time. From Figure 10, we observe that the threshold voltage shift over retention time is 
lower when the neighboring cell is in a higher-voltage state (e.g., the P3 state). 
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0 5 10 15 20 
# of Voltage Steps Shifted Over 24-Day Retention Time 


Fig. 10. Retention interference phenomenon observed at 10K P/E cycles. 


Insights. We are the first to quantify the retention interference phenomenon in 3D NAND flash 
memory. Our observation from Figure 10 shows that the amount of retention loss for a flash cell is 
correlated with its neighboring cell’s state. We expect retention interference to become stronger as 
we shrink the manufacturing process technology node in future 3D NAND flash memory devices. 
This is because the distance between neighboring cells will decrease, and fewer electrons will be 
stored within each flash cell, increasing the susceptibility of a cell to interference from neighboring 
cells. 


4.5 Other Error Characteristics 


In addition to the three new error sources we find in 3D NAND flash memory, we also characterize 

the behavior of other known error sources in 3D NAND flash memory and compare them to their 

behavior in planar NAND flash memory. We present a high-level summary of our findings for these 

errors here, and provide detailed results and analyses for them in Appendix A: 

e Unlike in planar NAND flash memory, we do not find any evidence of program errors [4, 64, 81] 
in 3D NAND flash memory (Section A.1.1). 

e P/E cycling error in 3D NAND flash memory follows a linear trend, which is similar to that 
in planar NAND flash memory using an older manufacturing process technology node (e.g., 
20-24 nm) [14]. However, in sub-20 nm planar NAND flash memory, P/E cycling error exhibits a 
power law trend [64, 81] (Appendix A.1.2). 

e 3D NAND flash memory experiences 40% less program interference than 20-24 nm planar NAND 
flash memory (Appendix A.1.3). 
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e 3D NAND flash memory experiences 96.7% weaker read disturb than 20-24 nm planar NAND 
flash memory. The impact of read disturb is low enough in 3D NAND flash memory that it does 
not require significant error mitigation (Appendix A.3.2). 

Note that these differences are mainly due to the larger manufacturing process technology nodes 

currently used in 3D NAND flash memory, and thus are not the focus of this paper. In comparison, 

the new error characteristics that we focus on (layer-to-layer process variation, early retention loss, 
and retention interference) are caused by the architectural and circuit-level changes introduced in 
3D NAND flash memory. 


4.6 Summary 


We summarize the key differences between 3D NAND and planar NAND flash memory, in terms 
of error characteristics and the expected trends for future 3D NAND flash memory devices, in 
Table 1. The first column of this table lists each attribute that we study. The second column shows 
the key difference in the observation that we find in 3D NAND flash memory versus planar NAND 
flash memory, for each attribute that we study. The third column shows the fundamental cause 
of each difference. The last column describes the expected trend of this difference in future 3D 
NAND flash memory devices. We provide the necessary characterizations and models that help us 
quantitatively understand these differences in Appendix A.1.2, A.1.3, A.2, A.3.1, A.3.2, and A.4. 


5 3D NAND FLASH MEMORY ERROR MODELS 


In the previous sections, we have established a basic understanding of the similarities and differences 
between 3D NAND and planar NAND flash memory in terms of error characteristics and reliability. 
In this section, we quantify these differences by developing analytical models of the process 
variation (Section 5.1) and retention loss (Section 5.2) phenomena in 3D NAND flash memory. 
These models are useful for at least two major purposes. First, the insights obtained from using 
these models can motivate and enable us to develop new error mitigation mechanisms for 3D 
NAND flash memory. Second, the retention model and the model parameters are also useful for 
comparing the reliability of newer or older generations of planar NAND flash memory with our 
tested 3D NAND flash memory chips. We focus on developing these models using our existing 
characterization data from real 3D NAND flash memory chips (some of which was presented in 
Section 4). In Section 6, we discuss (1) how to efficiently learn the models for each chip online within 
the SSD controller by performing the characterization and model fitting online, and (2) how to use 
the online models to develop mechanisms that improve the lifetime of 3D NAND flash memory. 


5.1 RBER Variation Model 


Since the layer-to-layer variation in 3D NAND flash memory causes variation in RBER within a 
flash block, it is no longer sufficient to use a single RBER value to represent the reliability of all 
pages in that block. Instead, we model the variation in per-page RBER within a flash block as a 
gamma distribution (i.e., gamma(x, a, s) = ees) In this model, x is the RBER; a is the shape 
parameter, which controls how the RBER distribution is skewed; and s is the scale parameter, which 
controls the width of the RBER distribution. 

Figure 11 shows the probability density for per-page RBER within a block that has endured 10K 
P/E cycles. The bars show the measured per-page RBERs categorized into 50 bins, and the blue and 
orange curves are the fitted gamma distributions whose parameters are shown on the legend. The 
blue bars and curve represent the measured and fitted RBER distributions when the pages are read 
using the variation-agnostic Vop;. To find the variation-agnostic Vopr, we use techniques designed 


for planar NAND flash memory to learn a single optimal read reference voltage (Vopr) for each flash 
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block, such that the chosen voltage minimizes the overall RBER across the entire block [64, 76]. The 
orange bars and curve represent the measured and fitted RBER distributions when the pages are 
read using the variation-aware Vop;, on a per-page basis. To find the variation-aware Vpr, we use 
techniques that are described in Section 6.1 to efficiently learn an optimal read reference voltage 
for each page in the block, such that we minimize the per-page RBER. 
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Fig. 11. RBER distribution across pages within a flash block. 


We make three observations from the figure. First, the gamma distribution fits well with the 
measured probability density function of RBER variation across layers: the Kullback-Leibler di- 
vergence error value [53] between the measured and fitted distributions is only 0.09. Second, the 
average RBER reduces from 1.6 x 1074 to 1.4 x 1074 when we use the variation-aware Vopt- Third, 
some flash pages have a much higher RBER than the average RBER (e.g., > 4 x 1074) even when we 
use the variation-aware Vopr. This large gap between the worst-case RBER and the average RBER 
is caused by both layer-to-layer process variation and MSB-LSB RBER variation (see Figure 5 in 
Section 4.2). The pages that have the highest RBER are MSB pages that reside in the middle layers. 
This observation indicates that there is potential to significantly improve reliability by minimizing 
the RBER variation across flash pages (for which we describe a mechanism in Section 6.2). 


5.2 Retention Loss Model 


We construct a model to describe the early retention loss phenomenon and its impact on RBER 
(log(RBER)) and threshold voltage (V) in 3D NAND flash memory, as a function of retention time 
(t) and the P/E cycle count (PEC): log(RBER) = A- log(t) + B; V = A- log(t) + B. For both equations, 
A= a; PEC + P and B = y : PEC + ó, where a, p, y, and 6 are constants that change depending 
on which variable we are solving for. We use ordinary least squares method implemented in 
Statsmodel [88] to fit the model to our real characterization data described in Section 4.3. Recall 
that this data is collected from 72 flash pages belonging to 11 randomly-selected flash blocks. 
Following the experimental observations in Section 4.3 and in prior work [65, 69], we break down 
our model into two parts. The first part (A) models the retention loss at a certain P/E cycle count 
as a logarithmic function of retention time. The second part (B) models how the P/E cycle count 
changes the parameters of retention loss. 

Table 2 shows all of the parameters we use to model the RBER and the threshold voltage as 
a function of the retention time (t) and the P/E cycle count (PEC). In this table, the first column 
shows the modeled variable for each row. The second to fifth columns show the parameters (i.e., 
a, P, y, and ô) fitted to our model. Note that the model for the optimal V, does not have a and 
f parameters because V. is insensitive to retention time. The last column shows the adjusted 
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coefficient of determination (adjusted R2) of our model. We find that our model achieves high 
adjusted R? values for all variables except for ogg and Va, meaning that our model explains >89% 
of the variation in the characterized data. The adjusted R? values are relatively small for ogg and 
V, because these two variables do not change much with the retention time or the P/E cycle count. 
We conclude that our model is accurate and easy to compute (as it can be computed using simple 
linear regression). Thus, our model is suitable to use online in the SSD controller (for which we 
will describe a mechanism in Section 6.3). 


Model Parameters for: 


Variable Variable = (æ : PEC + B)-log(t)+y-PEC+65 Adjusted R? 
a p Y ô 
MSB RBER log(RBERmsg) 5.49 x 10 Š 0.16 1.33 x 1074 -13.11 97.17% 
LSB RBER log(RBER;sg) 7.92 x 1076 0.25 3.28 x 107° -12.72 90.05% 
ER Mean HER 1.01 x 1074 0.74 1.52x 107? -27.27 96.86% 
P1 Mean HP1 -1.94 x 10 > -0.40 3.51X 1074 114.47 95.88% 
P2 Mean HP2 -4.71 x 10 > -0.70 3.23 X 107-4 189.58 98.50% 
P3 Mean HP3 -7.37 x 10 > -1.20 5.75 X 1074 264.85 98.29% 
ER Stdev OER 1.20 x 1077 -0.10 1.63 107° 17.01 56.33% 
P1 Stdev opi -1.34 X 1076 9.83 x 1073 7.55x107> 10.20 93.20% 
P2 Stdev Op2 -2.12 X 1076 9.85x107? 669x107 10.65 89.02% 
P3 Stdev Op3 2.87x10 ° 1.40x107? 3.30x107> 10.83 93.00% 
Optimal V, Va = a 1.20x 107 60.52 71.20% 
Optimal Vp Vp -3.72 x 10 -0.57 4.201074 150.56 94.27% 
Optimal V; V. -6.51 x 10 -1.06 4.81X 107 227.24 97.72% 


Table 2. Retention loss model for 3D NAND flash memory and its model parameters. PEC is P/E cycle lifetime, 
t is retention time. 


6 3D NAND ERROR MITIGATION TECHNIQUES 


Motivated by our new findings in Section 4, we aim to design new techniques that mitigate the 
three unique error effects (i.e., layer-to-layer process variation, early retention loss, and retention 
interference) in 3D NAND flash memory. We propose four error mitigation mechanisms. To mitigate 
layer-to-layer process variation, we propose LaVAR and LI-RAID. LaVAR learns our new RBER 
variation model (see Section 5.1) online in the SSD controller, and uses this model to predict 
and apply an optimal read reference voltage that is fine-tuned to each layer (Section 6.1). LI- 
RAID is a new RAID scheme that reduces the RBER variation induced by layer-to-layer process 
variation in 3D NAND flash memory (Section 6.2). To mitigate retention loss in 3D NAND flash 
memory, we propose ReMAR, a new technique that tracks the retention time information within 
the SSD controller and uses our new retention loss model (see Section 5.2) to predict and apply the 
optimal read reference voltage that is fine-tuned to the retention time of the data (Section 6.3). To 
mitigate retention interference, we propose ReNAC, which is adapted from neighbor-cell assisted 
correction (NAC) [16], an existing technique originally designed to reduce program interference in 
planar NAND flash memory, to also account for retention interference in 3D NAND flash memory 
(Section 6.4). 
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6.1 LaVAR: Layer Variation Aware Reading 


In planar NAND flash memory, existing techniques assume that the RBER is the same across all 
pages within a flash memory block, and, thus, a single Vo»; value can be used for all pages in the 
block [6, 76]. This approach is called variation-agnostic Vop;. However, as our results in Section 4.2 
show, this assumption no longer holds in 3D NAND flash memory due to layer-to-layer process 
variation, as each page in a block resides in a different layer. We aim to improve flash memory 
lifetime by mitigating layer-to-layer process variation and reducing the RBER. The key idea is to 
identify how much the read reference voltage must be offset by for each layer in a flash chip, to 
account for the layer-to-layer process variation, instead of using a single read reference voltage 
for the entire block irrespective of layers. When the SSD controller performs a read request, it 
accounts for (1) per-block variation in RBER, by predicting a variation-agnostic V,»; based on the 
P/E cycle count of the flash block; and (2) layer-to-layer variation, by adding the layer-specific 
offset to the variation-agnostic Vopr for the target block. This generates a variation-aware Vop; that 
the controller uses as the read reference voltage. 

Mechanism. We devise a new mechanism called Layer Variation Aware Reading (LaVAR), which 
(1) learns the voltage offsets for each layer and records them in per-chip tables in the SSD controller, 
and (2) uses the variation-aware V,,; during a read operation by reading the appropriate voltage 
offset for the request from the per-chip table that corresponds to the layer of the request. LaVAR 
constructs a model of the optimal read reference voltage (Vopr) variation across different layers. 
Since there are only a limited number of layers, this model can be represented as a table (i.e., it 
is a non-parametric model) of the offset between the Vo»; for each layer (variation-aware Vopr) 
and the overall Vo»; for the entire flash block (variation-agnostic Vpt). Any previously-proposed 
model for Wp [6, 64, 76] can be used to calculate the variation-agnostic Vopr. Since the layer- 
to-layer process variation is similar across blocks and is consistent across P/E cycle counts, the 
Vop: Variation model can be learned offline for each chip through an extensive characterization 
of a single flash block. To do this, the SSD controller randomly picks a flash block and records 
the difference between the variation-aware Vj»; and the variation-agnostic Vopr. LaVAR uses the 
existing read-retry functionality in modern NAND flash memory chips (see Section 4.1) to find 
the variation-aware Vo»; online. The controller then computes and stores the average Vop; offset 
for each layer in a lookup table stored for each chip. Note that V, variation does not need to be 
modeled, since V; is unaffected by layer-to-layer process variation (see Figure 6 in Section 4.2). 

When performing a read operation, the SSD controller simply looks up the Vop; offset that 
corresponds to the layer and the chip that contains the data being read, and adds the offset to the 
per-block Vop; predicted by existing techniques [6, 64, 76]. By using variation-aware Vopr, LaVAR 
enables the use of a more accurate Vp; for 3D NAND flash memory than existing techniques, and 
thus reduces the RBER (see Figure 11 in Section 5.1). 

Overhead. LaVAR can be implemented fully in the SSD controller firmware, and, thus, does not 
require any modification to the hardware. Assuming that the 3D NAND flash memory chip has 
N layers and that it takes 1 Byte to store each Vopr offset for each layer, the memory overhead of 
storing the lookup table for V, and V, in the SSD controller is 2N Bytes. The latency overhead of 
each read operation is negligible as LaVAR requires only a table lookup and an addition to obtain 
variation-aware Vo»;, which take less than 100 ns. Since the lookup table is shared across all blocks 
in a chip, it needs to be learned only once, and it can be constructed gradually in the background. 
Thus, the performance overhead of LaVAR is negligible. 

Evaluation. Figure 12 compares the RBER obtained by using LaVAR (variation-aware Vop:) [6, 64, 
76] to that obtained by using an existing read reference voltage tuning technique (variation-agnostic 
Vopz) designed for planar NAND flash memory. We evaluate the average RBER obtained by each 
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mechanism by simulating read operations using our characterization data in Section 4.2. Averaged 
across all P/E cycle counts, LaVAR reduces the RBER by 43.3%. The benefit comes from tuning 
the read reference voltage towards the variation-aware V,,; by an offset learned by our model. 
The RBER reduction becomes smaller as the P/E cycle count increases, because the overall RBER 
increases exponentially as the NAND flash memory wears out, decreasing the fraction of process 
variation errors. While the flash lifetime improvements produced by LaVAR might seem small (as 
we show in Section 6.5), (1) they are achieved with negligible overhead, and (2) the RBER reduction 
enabled by LaVAR throughout the flash memory lifetime reduces the average flash read latency [6]. 
As the number of layers within a 3D NAND flash memory chip grows (e.g., vendors are already 
bringing chips with 96 layers to the market [1]), we expect that layer-to-layer process variation 
will increase, which in turn will increase the magnitude of the lifetime benefits provided by LaVAR. 
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Fig. 12. RBER reduction using LaVAR. 


6.2 LI-RAID: Layer-Interleaved RAID 


As we observe in Section 5.1, even after applying the variation-aware Vopr, the per-page RBER is 
distributed over a wide range according to a fitted gamma distribution due to layer-to-layer process 
variation and MSB-LSB RBER variation (see Figure 5 in Section 4.2). In enterprise SSDs, in addition 
to ECC, the Redundant Array of Independent Disks (RAID) [2, 83] error recovery technique is 
used across multiple flash chips to tolerate chip-to-chip process variation in error rates. RAID in 
modern SSDs typically combines one flash page from each flash chip into a logical unit called a 
RAID group, and uses one of the pages to store the parity information for the entire group. However, 
state-of-the-art RAID schemes do not consider layer-to-layer process variation and MSB-LSB RBER 
variation. These schemes group MSB or LSB pages in the same layer together in a RAID group. Asa 
result, the reliability of the SSD is limited by the RBER of the weakest (i.e., the least reliable) RAID 
group that contains the MSB or LSB pages from the least reliable layer across all chips. We devise a 
new RAID scheme called Layer-Interleaved RAID (LI-RAID), which eliminates these low-reliability 
RAID groups by equalizing the RBER among different RAID groups. LI-RAID makes use of two key 
ideas: (1) group flash pages in less reliable layers with pages in more reliable layers, and (2) group 
MSB pages with LSB pages. 

Mechanism. Instead of grouping pages in the same layer together in the same RAID group, 
we select pages from different chips and different layers and group them together, such that the 
low-reliability pages (either due to layer-to-layer process variation or MSB-LSB RBER variation) 
are distributed to different RAID groups. Thus, the new groups formed by LI-RAID have a more 
evenly-distributed RBER than the groups formed using traditional layer-unaware RAID schemes. 
We assume, without loss of generality, that there are m chips in the SSD, and each RAID group 
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contains m pages, one from each chip. We also assume that each block contains n wordlines, and 
that the layer numbers of each wordline are in ascending order (e.g., the wordline in layer i has a 
lower wordline number than its neighboring wordline in layer i+ 1). Thus, LI-RAID groups together 
the MSB page of wordline 0, the LSB page of wordline 7, the MSB page of wordline 2 - Ž, the LSB 
page ..., the MSB page of wordline (m — 2)- Z, the LSB page of wordline (m — 1) - 4. Figure 13 
shows an example LI-RAID layout on an SSD with 4 chips and with 4 wordlines within each flash 
block. Flash pages in the same RAID group are highlighted in the same color. In this way, LI-RAID 
distributes the less reliable pages within each chip across different RAID groups, thereby avoiding 
the formation of significantly less reliable RAID groups that bottleneck SSD reliability. 


Wordline# Layer# Page ChipO Chip1 Chip2 Chip3 


0 0 MSB Group0 Blank Group4 Group 3 
LSB Group1 Blank Group5 Group 2 
MSB Group2 Group1 Blank Group 5 
LSB Group3 Group0 Blank Group 4 
MSB Group4 Group3 Group0 Blank 

LSB Group5 Group2 Group1 Blank 

MSB Blank Group5 Group2 Group 1 
LSB Blank Group4 Group3 Group 0 


Q Q DD D F == oO 
Q Q DD D E == O 


Fig. 13. LI-RAID layout example for an SSD with 4 chips and with 4 wordlines in each flash block. 


Note that, since the order of RAID group number is different in each flash chip, the LI-RAID 
layout may potentially violate the program sequence recommended by flash vendors, where 
wordlines within each flash block must be programmed in order to minimize harmful program 
interference [9, 15, 16, 77]. For example, in Chip 2 in Figure 13, Wordline 3 (Groups 2 and 3) is 
programmed after Wordline 2 (Groups 0 and 1). In Chip 2, we leave Wordline 1 blank (marked 
as“Blank” in Figure 13). Otherwise, Wordline 1 would cause program interference to the data in 
Wordline 2, which already experiences program interference when Wordline 3 is programmed, 
significantly increasing the error rate of Wordline 1 [15, 16] (see Appendix A.1.3). By laying out the 
data in the proposed manner, LI-RAID provides the same reliability guarantee as the recommended 
program sequence, by guaranteeing that any data stored in a flash page experiences program 
interference from at most one neighboring wordline. 

Overhead. The grouping of flash pages by LI-RAID is implemented entirely in the SSD controller 
firmware. This requires the firmware to be aware of the physical-page-to-layer mapping. The flash 
pages left blank in LI-RAID incur a small additional storage overhead compared to a conventional 
RAID scheme. Only one wordline (i.e., two pages in MLC NAND flash memory) within a flash 
block is left blank, to mitigate the impact of program interference on Groups 0 and 1. Without 
this blank wordline, the data in Groups 0 and 1 would be the only data to experience program 
interference twice: once when Groups 2 and 3 are programmed, and once when the last two groups 
are programmed. In modern NAND flash memory, each flash block typically contains at least 256 
flash pages. Thus, the additional storage overhead for the blank pages is less than 0.8%. LI-RAID 
does not incur additional computational overhead because it computes parity in the same way as 
a conventional RAID scheme, and only reorganizes the RAID groups differently. Because we do 
not change the data layout across flash blocks, the flash translation layer (FTL) and the garbage 
collection (GC) algorithms remain the same as in a conventional RAID scheme. 
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Evaluation. Figure 14 plots the worst-case RBER (i.e., the highest per-page RBER within a flash 
block) when we use different error mitigation techniques at 10,000 P/E cycles. Recall that the 
per-page RBER within a flash block follows a gamma distribution (see Figure 11 in Section 5.1). 
Thus, several least-reliable flash pages within a block may become unusable (i.e., their RBER exceeds 
the ECC correction capability) before the overall RBER of the flash chip exceeds the ECC correction 
capability. We use the worst-case RBER to represent the reliability of these least-reliable flash 
pages. In this figure, the baseline uses the per-block variation-agnostic optimal read reference 
voltage (i.e., variation-agnostic Vpt), achieving a worst-case RBER of 4.8 - 1074. When we use the 
variation-aware Vp; proposed in Section 6.1, the worst-case RBER is reduced by 9.6% over the 
baseline, to 4.3 - 1074. LI-RAID reduces the worst-case RBER by 66.9% over the baseline, to only 
1.6 - 10 4. Thus, by grouping flash pages on less reliable layers with pages on more reliable layers, 
and by grouping MSB pages with LSB pages, LI-RAID reduces the probability of unusable pages 
within a block, thereby reducing the number of retired flash blocks due to ECC failures. 


Baseline (Variation-agnostic Vopt) 
LaVAR (Variation-aware Vopt) 
LaVAR+LI-RAID 


0 0.0001 0.0002 0.0003 0.0004 0.0005 
Worst-Case RBER 


Fig. 14. Effect of LaVAR and LI-RAID on worst-case RBER at 10,000 P/E cycles. 


Note that LaVAR and LI-RAID do not rely on whether the RBER variation is consistent across all 
chips. LaVAR learns a different lookup table for each chip. So, even if there is some chip-to-chip 
process variation that is present, our models are effective at dynamically capturing the behavior of 
any NAND flash memory chips. Conventional RAID tolerates only chip-to-chip process variation. 
LI-RAID improves flash reliability over conventional RAID by eliminating the strong correlation 
between RBER and layer number, which we show in Figure 5. We conclude that both LaVAR and 
LI-RAID are effective at reducing the impact of layer-to-layer variation on the RBER. 


6.33 ReMAR: Retention Model Aware Reading 


As we show in Section 4.3, due to early retention loss, retention errors increase much faster after 
programming a page in 3D NAND flash memory than they do in planar NAND flash memory. 
Thus, mitigating retention errors has become more important in 3D NAND than in planar NAND 
flash memory, as the errors have a greater impact on SSD reliability. However, as we show in our 
model in Section 5.2, the RBER impact of early retention loss is proportional to the logarithm of 
retention time. This means that a large majority of the retention errors and threshold voltage shifts 
happen shortly after programming. As a result, traditional retention error mitigation techniques 
developed for planar NAND flash memory, which are optimized for much larger retention times, 
may become less effective on 3D NAND flash memory. For example, Flash Correct-and-Refresh 
(FCR) [7, 8], a mechanism that remaps all data periodically, allows planar NAND to tolerate 50x 
more P/E cycles with a 3-day refresh period. However, according to our evaluations, the P/E cycle 
lifetime improvement of FCR reduces to only 2.7x for 3D NAND flash memory due to the early 
retention loss phenomenon. This motivates us to explore new ways to mitigate retention errors in 
3D NAND flash memory. 

Mechanism. We propose a new mechanism called Retention Model Aware Reading (ReMAR), 
whose key idea is to accurately track the retention time of the data and apply the optimal read 
reference voltage predicted by our model in Section 5.2. First, REMAR constructs the same linear 
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models proposed in Section 5.2 online to accurately predict the optimal Vz, Vp, and Ve. Similar 
to the distribution parameter model used in Section 5.2, we model the optimal V, and V; as: 
V =(a- PEC + P) : log(t) + y - PEC + ó. We model the optimal V, as: Va = y - PEC + ô, since V, 
is not affected by retention time (as we show empirically in Section 4.3). To construct this model 
online, the controller randomly selects a flash block and records the optimal read reference voltage 
of the block (which the controller learns by sweeping the read reference voltages, as done in prior 
work [6]), along with the block’s P/E cycle count (PEC) and retention time (t). Over time, these data 
samples would cover a range of P/E cycle counts and retention times.* Note that as the P/E cycle 
count of the SSD increases, the accuracy of the model increases, because more data samples are 
collected. Once this online model is constructed, it is used in the controller to predict the optimal 
read reference voltage to be used for each read operation. To do this, the SSD controller stores 
the P/E cycle count and the program time of each block as metadata. During each read operation, 
the controller computes the retention time for each read by subtracting the program time from 
the read time. Using the recorded P/E cycle count and the computed retention time of the data, 
ReMAR applies the online model to predict Va, Vp, and V. By accurately predicting and applying 
the optimal read reference voltages, ReMAR increases the accuracy of read operations and thereby 
decreases the raw bit error rate. 

Overhead. Like LaVAR, ReMAR is implemented fully in the SSD controller firmware, and does 
not require any modifications to the hardware. Assuming that the flash block size is 5 MB, and 
that ReMAR stores the program time in the UNIX Epoch time format [67], which takes up 4B, the 
memory and storage overhead of ReMAR is 800KB for a 1TB SSD. The performance overhead of 
each read operation is small, as REMAR needs only a few dozen CPU cycles (on the order of 100 ns in 
total) in the SSD controller to compute Vopr, which is negligible compared to flash read latency (on 
the order of 10 us). The performance overhead of learning the model can be hidden by (1) performing 
learning in the background and (2) deprioritizing the requests issued for characterization purposes. 

The controller uses the UNIX Epoch time format [67] for program and read times, such that the 
recorded time is valid after reboot. To do this, the controller needs a real-time clock to keep track 
of the current time. Without a power source on the SSD, the controller needs a special command to 
synchronize the current time with the host when it boots up. The program time of each block is 
stored in the memory of the controller, along with other metadata that already exists such as the 
logical address map and the P/E cycle count of each block. 

Evaluation. Figure 15 compares the RBER achieved by ReMAR to that of the state-of-the-art 
read reference voltage tuning technique [64] designed for planar NAND flash memory (Baseline). 
The results are based on the characterization data in Section 4.3. We assume that the average 
retention time of the data is 24 days. The Baseline technique is unaware of the retention time. Thus, 
Baseline uses a retention-agnostic Vp; based on only the P/E cycle count of the flash page. REMAR 
uses a retention-aware Vo»; based on both the P/E cycle count and the retention time of the flash 
page. On average across all P/E cycle counts, ReMAR reduces the RBER by 51.9%. As the P/E cycle 
count increases, the benefit of REMAR (i.e., the RBER improvement of ReMAR over Baseline) also 
increases. We conclude that, by accurately tracking retention time, and by using our retention loss 
model, ReMAR accurately adapts the read reference voltage to the threshold voltage shifts that 
occur due to retention loss, and hence it effectively reduces the RBER. 


6.4 ReNAC: Retention Interference Aware Neighbor-Cell Assisted Correction 


As we observe in Section 4.4, due to retention interference, the amount of threshold voltage shift 
of a victim cell during a certain amount of retention time is affected by the value stored in a 


8The SSD controller can also perform additional characterization if a certain data range is missing. 
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Fig. 15. RBER reduction using ReMAR. 


vertically-adjacent neighbor cell. This phenomenon presents a similar data dependency as that 
induced by program interference, where the amount of the threshold voltage shift of a victim 
cell during programming operation also depends on the value stored in the directly-neighboring 
cells [15, 16]. To mitigate program interference errors, prior work proposes neighbor-cell assisted 
correction (NAC) [16]. The goal of NAC is to reduce the raw bit error rate by reading each cell at 
the read reference voltage optimized for the amount of program interference induced by its directly- 
neighboring cells. To achieve this goal, after error correction fails on a flash page, NAC reads the 
data stored in the neighboring wordline and re-reads the failed page using a set of read reference 
voltage values that are adjusted based on the data values stored in the directly-neighboring cells [16]. 
However, this mechanism does not account for retention interference induced by the neighboring 
cells, which is new in 3D NAND flash memory. We adapt NAC for 3D NAND flash memory to 
account for the new retention interference phenomenon, and call this adapted mechanism Retention 
Interference Aware Neighbor-Cell Assisted Correction (ReNAC). 

Mechanism. The key idea of ReNAC is to use the data stored in a vertically-adjacent neighbor 
cell to predict the amount of retention interference on a victim cell. Using similar techniques from 
Section 5.2, ReNAC first develops an online model of retention interference as a function of the 
retention time and the neighbor cell’s state. The SSD controller obtains the retention time of each 
block using a mechanism similar to REMAR, and computes and applies the neighbor-cell-dependent 
read offset at that retention time from the model. For ReNAC, we are currently unable to show any 
meaningful improvements in flash lifetime for the current generation of 3D NAND flash memory, 
because retention interference shifts the threshold voltage by only less than two voltage steps 
(Figure 10), which is much smaller than the voltage changes due to process variation (Figure 6) 
and early retention loss (Figure 8). However, we expect that retention interference will increase 
in future 3D NAND flash memory devices due to decreasing cell sizes and decreasing distances 
between neighboring cells (Table 1), which, in turn, will likely increase the benefit of using ReNAC. 
We also expect ReNAC to have a relatively larger benefit in 3D NAND flash memory chips that 
use triple-level cell (TLC) or quadruple-level cell (QLC) technologies. A TLC or QLC NAND flash 
memory chip stores more bits in a cell than an MLC NAND flash memory chip, by splitting up 
the same voltage range into a greater number of states (eight for TLC and sixteen for QLC). Doing 
so reduces the voltage margin between neighboring threshold voltage distributions. Therefore, 
shifting the read reference voltage by two voltage steps may affect more cells in TLC and QLC 3D 
NAND flash memory than in MLC 3D NAND flash memory, and, thus, ReNAC can reduce a greater 
number of raw bit errors in future TLC or QLC NAND flash memory. We leave a quantitative 
evaluation of ReNAC on future 3D NAND flash memory chips to future work. 
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6.5 Putting It All Together: Effect on System Reliability and Performance 


The mechanisms we propose in this section can be combined together to achieve significant 
reductions in average and worst-case RBER. For a consumer-class 3D NAND flash memory device, 
these reductions improve flash memory lifetime, i.e., the device can tolerate more P/E cycles before 
failing. For an enterprise-class device which is expected to be replaced after a fixed amount of 
time, these reductions improve the sustainable workload write intensity or reduce the ECC storage 
overhead. We evaluate these potential effects of our mechanisms on storage system reliability and 
performance. 

Flash Lifetime (or Performance) Improvement. In Figure 16, we compare and contrast the 
reliability (i.e., the RBER) of five example SSDs: (1) Baseline, an SSD that uses a fixed, default read 
reference voltage and employs a conventional RAID scheme; (2) State-of-the-art, an SSD that uses 
the optimal read reference voltage predicted by existing mechanisms designed for planar NAND 
flash memory [6, 64, 76, 81] and employs a conventional RAID scheme; (3) LaVAR, an SSD that uses 
the optimal read reference voltage for each layer predicted by LaVAR in addition to State-of-the-art; 
(4) LaVAR+LI-RAID, an SSD that uses the LI-RAID scheme in addition to LaVAR; and (5) This Work 
(LaVAR + LI-RAID + ReMAR), an SSD that uses the optimal read reference voltage predicted by 
LaVAR and ReMAR, and also employs the LI-RAID scheme. In this figure, we plot the worst-case 
RBER (i.e., the highest per-page RBER within a flash block) instead of the average RBER, because 
the worst-case RBER limits the flash memory lifetime. Because RBER increases with P/E cycle 
count, if the worst-case RAID group has a high enough worst-case RBER, NAND flash memory can 
no longer guarantee reliable operation. 
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Fig. 16. Effect of LaVAR, LI-RAID, and ReMAR on worst-case RBER experienced by any flash block. 


Assuming that the ECC deployed on the SSD can correct errors up to an RBER of 3 - 107? [6, 9] 
(ie., the ECC limit, shown as a purple dashed line in Figure 16), we can calculate the lifetime of 
each SSD we evaluate.’ In our evaluations, the flash memory lifetime ends when the worst-case 
RBER exceeds the ECC limit. We find that State-of-the-art, LaVAR, LaVAR+LI-RAID, and This Work 
improve flash memory lifetime by 23.8%, 25.3%, 57.2%, and 85.0%, respectively, over the Baseline. 
When the SSD is used in a server, which has a fixed device lifetime, the server has to throttle 


°Note that we are unable to directly measure the flash lifetime improvements on real devices, because manufacturers do not 
provide us with the ability to modify the SSD firmware directly, which prevents us from evaluating our techniques on the 
real devices themselves. Unfortunately, we also do not have the resources to measure the lifetime of a large number of real 
flash chips by emulating the behavior of our mechanisms, as this would require many additional months to years of effort. 
Instead, we follow the precedent of prior work to evaluate the flash memory lifetime based on real RBER characterization 
data we obtain from the testing of real flash memory devices. 
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the write frequency to a certain drive writes per day (DWPD) to ensure that the SSD can operate 
reliably during the fixed lifetime. In this case, our combined mechanisms (This Work) increase the 
maximum write frequency (i.e., the maximum DWPD) of the SSDs in a server by 85.0%. Thus, our 
mechanisms either improve lifetime or improve performance under a fixed lifetime. 

ECC Storage Overhead Reduction. In modern SSDs, the storage overhead for error correction 
increases in each generation to better tolerate the degraded flash reliability due to aggressive 
scaling. For example, to tolerate an RBER of up to 3 - 107° for the Baseline SSD at the end of 
its lifetime, a modern BCH code [36] requires 12.8% storage overhead for the redundant ECC 
bits [25] (i.e., ECC redundancy). By deploying all of our proposed error mitigation techniques in 
an enterprise-class SSD, the RBER at the end of the fixed flash memory lifetime is significantly 
lower compared to Baseline. Thus, we can redesign the ECC deployed in the SSD to tolerate only 
up to the reduced RBER, which requires fewer ECC bits and, thus, lower ECC redundancy than the 
ECC required for the Baseline. Assuming all five of the evaluated SSDs achieve the same lifetime, 
and the same reliability (i.e., uncorrectable error rate) at the end of their lifetime, State-of-the-art, 
LaVAR, LaVAR+LI-RAID, and This Work reduce ECC redundancy by 42.2%, 45.3%, 68.8%, and 78.9%, 
respectively, over Baseline. We leave the evaluation of the performance improvements due to a 
weaker ECC requirement [22, 59] for future work. 

We conclude that by combining LaVAR, LI-RAID, and ReMAR, we can (1) achieve significant 
improvements in the lifetime of 3D NAND flash memory, (2) enable higher write intensity in 
workloads within a given lifetime requirement, or (3) keep the lifetime constant but greatly reduce 
the storage cost of reliability in 3D NAND flash memory. 


7 RELATED WORK 


To our knowledge, this paper is the first in open literature to (1) show the differences between the 
error characteristics of 3D NAND flash memory and that of planar NAND flash memory through 
extensive characterization using real 3D NAND flash memory chips, (2) develop models of layer-to- 
layer process variation and early retention loss for 3D NAND flash memory, and (3) propose and 
show the benefits of four new mechanisms based on the new error characteristics of 3D NAND 
flash memory. Due to the importance of NAND flash memory reliability in storage systems, there 
is a large body of related work. We treat this related work in five different categories. 

3D NAND Flash Memory Error Characterization. Two recent works compare the retention 
loss phenomenon between 3D NAND and planar NAND flash memory [65, 70] through real 
device characterization, and report findings similar to our work regarding the early retention 
loss phenomenon. Two other recent works use a methodology similar to ours to characterize 3D 
NAND devices based on different 3D NAND flash memory cell technologies (i.e., 3D floating-gate 
cell and 3D vertical gate cell) [38, 94, 95], which are less common than the 3D charge trap NAND 
flash memory cell technology that we test in this paper. Other recent works [23, 31, 78, 80, 92] 
report several differences of 3D NAND flash memory from planar NAND flash memory. These 
differences include (1) smaller program variation at high P/E cycle counts [80], (2) smaller program 
interference [80], (3) layer-to-layer process variation [92], (4) early retention loss [23, 31, 78], 
and (5) retention interference [23]. While prior works have reported on the existence of these 
errors, none of them provide a comprehensive characterization of all of the different errors using 
the same chips. Only one of these prior works [23] provides a detailed analysis based on circuit- 
level measurements and characterizations, and does so only for early retention loss and retention 
interference. Other works provide only a high-level summary of real device characterization [80] 
or do not provide any real device characterization results at all [31, 78, 92]. Our work performs an 
extensive detailed analysis of all known sources of error in 3D NAND flash memory chips, which 
allows us to understand the relative impact of each error source on the same chip. We report the 
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first set of extensive results on three error characteristics that are new in 3D NAND flash memory: 
layer-to-layer process variation, early retention loss, and retention interference. 

Planar NAND Flash Memory Error Characterization. A large body of prior work studies 
all types of error sources on planar NAND flash memory, including P/E cycling errors [9, 14, 64, 
81], programming errors [4, 64, 81], cell-to-cell program interference errors [15, 16], retention 
errors [6, 7, 9, 28], and read disturb errors [5, 9]. These works characterize how the raw bit error 
rate and threshold voltage change due to various types of error sources. A detailed survey of such 
prior works on planar NAND flash memory can be found in our recent survey articles [9, 11]. Our 
paper experimentally studies all of these error mechanisms in the new 3D NAND flash memory 
context, and compares 3D NAND flash memory error characteristics with results in these prior 
works to show the differences between 3D NAND and planar NAND flash memory. Prior work 
demonstrates the early retention loss phenomenon in planar NAND flash memory based on charge 
trap transistors [21], which is similar to, but not as severe as, the early retention loss phenomenon 
in 3D NAND flash memory. We investigate retention interference and process variation related 
errors, in addition to these other error types discovered before in planar NAND flash memory. 

Planar NAND Error Modeling and Mitigation. Based on characterization results, prior work 
proposes models for planar NAND flash memory threshold voltage distribution, and models for 
estimating the effect of P/E cycling on the threshold voltage distribution [14, 64, 81]. Our work 
uses a simpler threshold voltage distribution model, since more complex models are designed to 
handle programming errors in planar NAND flash memory that do not exist in the 3D NAND 
flash memory chips that we test. We develop a unified model of retention loss and wearout for 
the RBER, threshold voltage distribution, and V,,; in 3D NAND flash memory. There is a large 
body of prior work that proposes mechanisms to mitigate planar NAND flash memory errors [4- 
9, 11, 15, 16, 32, 33, 37, 40, 41, 60, 63, 64, 74, 75, 93, 98]. In Section 6, we have already compared our 
mechanisms to several of these techniques that are state-of-the-art, and have shown that prior 
techniques developed for planar NAND flash memory are less effective in 3D NAND flash memory 
than our techniques due to the new error characteristics of 3D NAND flash memory. 

3D NAND Flash Memory Error Mitigation. Prior work proposes circuit-level and system- 
level techniques to tolerate layer-to-layer process variation in 3D NAND flash memory. Two recent 
works propose to use different read reference voltages for different layers [38, 96], which is similar 
to the LaVAR technique that we propose in Section 6.1. Unlike our work, these prior works do 
not (1) design a detailed mechanism like LaVAR to learn and use the Vp; in a lookup table, or 
(2) evaluate their techniques using real characterization data. Wang et al. propose to apply different 
read reference voltages for less-reliable pages storing critical metadata [92]. As we have shown in 
Section 6.1, while these prior techniques improve average RBER, they do not significantly reduce 
worst-case RBER, which limits the flash memory lifetime. In this work, we propose a series of 
mitigation techniques that not only significantly reduce the average and worst-case RBER but also 
tolerate other new error characteristics we find in 3D NAND flash memory, such as early retention 
loss and retention interference. 

Large-Scale SSD Error Characterization. Prior work performs large-scale studies of errors 
found in flash memories deployed in data centers [68, 72, 87]. Since the operating system is unaware 
of the raw bit errors in the NAND flash memory devices, these studies can only use drive-level 
statistics provided by the SSD controller, such as overall RBER and uncorrectable error rate, average 
P/E cycle count, and a coarse estimation of retention time and read disturb counts. In contrast, 
in our studies, we have complete access to the physical location, P/E cycle count, retention time, 
and read disturb count of each read/write operation, and thus can provide deeper insights and 
controlled experimental results compared to large-scale studies, which have to be correlational in 
nature. 
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DRAM Error Characterization. Like a flash memory cell, a DRAM cell stores charge to repre- 
sent a piece of data. Hence, DRAM has many error characteristics that are similar to NAND flash 
memory. For example, charge leaks from a DRAM cell over time, at a speed much faster than that 
for NAND flash memory (i.e., on the order of milliseconds to seconds in DRAM [61, 62]), leading to 
data retention errors. This phenomenon in DRAM is analogous to the retention loss phenomenon in 
NAND flash memory (see Section 4.3 and Appendix A.2), and its effect has been studied through 
extensive experimental characterization of DRAM chips [34, 35, 44, 46-49, 51, 56, 61, 82, 85]. Similar 
to the retention interference phenomenon found in 3D NAND flash memory (see Section 4.4), 
DRAM exhibits data-dependent retention behavior, or data pattern dependence (DPD) [61], where 
the retention time of a DRAM cell is dependent on the values written to nearby DRAM cells [46- 
49, 61, 82]. Conceptually similar to the read disturb errors found in NAND flash memory (see 
Appendix A.3.2), commodity DRAM chips that are sold and used in the field today exhibit read 
disturb errors [52], also called RowHammer-induced errors [71]. These errors are affected by process 
variation, which we comprehensively examine in 3D NAND flash memory (see Section 4.2 and 
Appendix A.4). Process variation in DRAM is shown to also affect access latency, retention time, 
and power consumption [17-20, 30, 34, 35, 43, 44, 46-49, 51, 54-56, 61, 62, 66, 82, 85]. 


8 CONCLUSION 


We develop a new understanding of three new error characteristics in 3D NAND flash memory 
through rigorous experimental characterization of real, state-of-the-art 3D NAND flash memory 
chips: layer-to-layer process variation, early retention loss, and retention interference. We analyze 
and show that these new error characteristics are fundamentally caused by changes introduced in 
the 3D NAND flash memory architecture compared to the planar NAND flash memory architecture. 
To handle these three new error characteristics in 3D NAND flash memory, we develop new 
analytical models for layer-to-layer process variation and early retention loss in 3D NAND flash 
memory. Our models can accurately predict/estimate the optimal read reference voltage and the 
raw bit error rate based on the retention time and the layer number of each flash memory page. 
We propose four new error mitigation techniques that utilize our new models to improve the 
reliability of 3D NAND flash memory. Our evaluations show that our newly-proposed techniques 
successfully mitigate the new error patterns that we discover in 3D NAND flash memory. We hope 
that the rigorous and comprehensive error characterization and analyses performed in this work 
motivate future rigorous studies on 3D NAND flash memory reliability, and that they inspire new 
error mitigation mechanisms that cater to the new error characteristics found in 3D NAND flash 
memory. 
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A APPENDIX 
A.1 Write-Induced Errors 


We analyze how each type of write-induced error affects the RBER and the threshold voltage 
distribution of 3D NAND flash memory. 


A.1.1 Program Errors. Program errors occur when the data is incorrectly written to the NAND 
flash memory [4, 9, 11, 79]. Such errors are introduced when multiple programming operations 
are required to write data to a single cell. For example, in many MLC NAND flash memory 
devices, two-step programming [4, 79] is employed. Two-step programming uses two separate 
partial programming steps to write data to an MLC NAND flash cell. In the first step, the flash 
controller writes only the LSB to the cell, setting the cell to a temporary voltage state. In the second 
step, the controller writes the MSB to the cell, but in order to perform this write, the controller must 
first determine the current voltage state of the cell. This requires reading the partially-programmed 
data from the cell, during which an error may occur. This error causes the controller to incorrectly 
set the final voltage state of the cell during the second programming step, and, thus, is called a 
program error. Prior work [4] shows that program errors occur in state-of-the-art planar MLC 
NAND flash memory. 

Current generations of 3D NAND flash memory use one-shot programming [4, 9, 11, 79], which 
programs both the LSB and MSB of a cell at the same time. As a result, current 3D NAND flash 
memory devices do not experience program errors. Our measurements in Figure 4 confirm the lack 
of program errors in 3D NAND flash memory. In an MLC NAND flash memory that has program 
errors, the threshold voltage distributions of the ER and P1 states have secondary peaks near the P2 
and P3 states, respectively [4]. This is because program errors affect only the LSB, since only the 
LSB is being read during the second programming step. Since there is no second peak in Figure 4, 
there are no program errors. 

Program errors may appear in future 3D NAND flash memory devices. In planar NAND flash me- 
mory, two-step programming was introduced when planar MLC NAND flash memory transitioned 
to the 40 nm manufacturing process technology node, in order to reduce the number of program 
interference errors [79]. A similar transition may occur in the future to continue scaling the density 
of 3D NAND flash memory, especially as it becomes increasingly difficult to add more layers into a 
3D NAND flash memory chip. Thus, we conclude that today’s 3D NAND flash memories do not 
have program errors, but program errors may appear in future generations. 


A.1.2 Program/Erase Cycling Errors. A P/E cycling error occurs because of the natural variation 
of the threshold voltage of cells in each state [14, 69] due to the inaccuracy of each program and 
erase operation (see Section 2.2). Such inaccuracy during program and erase operations increases 
as the P/E cycle count increases. To study the impact of P/E cycling errors, we randomly select a 
flash block within each 3D NAND chip, and wear out the block by programming random data to 
each page in the block until the block reaches 16K P/E cycles. Using the methodology described in 
Section 4.1, we obtain the overall RBER and the threshold voltage of each cell at various P/E cycle 
counts.” 

Observations. Figure 17 shows how the mean and standard deviation of the threshold voltage 
distribution of each state change as a function of the P/E cycle count, when we fit our voltage 
measurements for each state to a Gaussian model. Each subfigure in the top row represents the 
mean for a different state; each subfigure in the bottom row represents the standard deviation for 
a different state. The blue dots shows the measured data; each orange line shows a linear trend 


10Due to limitations with our experimental testing platform, each data point at a particular P/E cycle count has a retention 
time of 50 minutes. 
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fitted to the measured data. The x-axis shows the P/E cycle count; the y-axis shows the mean 
(Figures 17a-17d) or the standard deviation (Figures 17e-17h) of the threshold voltage distribution 
of each state, in voltage steps. We make four observations from Figure 17. First, the mean and 
standard deviation of all states increase linearly as the P/E cycle count increases. We fit a line 
using linear regression, shown as an orange dotted line in each subfigure.'' Second, the threshold 
voltage distributions of the ER and P1 states shift to higher voltages, while the distributions of the 
P2 and P3 states shift to lower voltages, causing the distributions to move closer to the middle of 
the threshold voltage range. Third, the threshold voltage distributions of all four states become 
wider (i.e., the standard deviation increases) as the P/E cycle count increases. Since the distributions 
shift towards the middle of the threshold voltage range and become wider as the P/E cycle count 
increases, the distributions become closer to each other, which increases the raw bit error rate. 
Fourth, the magnitude of the threshold voltage shift and the widening of the distributions is much 
larger for the ER state than it is for the other three states (i.e., P1, P2, P3). Therefore, EReP1 errors 
(i.e., an error that shifts a cell that is originally programmed in the ER state to the P1 state, or vice 
versa) increase faster than other errors with the P/E cycle count. 
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Fig. 17. Mean and standard deviation of our Gaussian threshold voltage distribution model of each state, 
versus P/E cycle count. 


Figure 18 shows how the RBER increases as the P/E cycle count increases. The top graph breaks 
down the errors into which bit (i.e., LSB or MSB) they occur in. The bottom graph breaks down 
the errors based on how the error changed the cell state due to a shift in the cell threshold voltage. 
If the error caused either the LSB or MSB (but not both) to be read incorrectly, we refer to that 
error as a single-bit error (ER <> P1, P1 <> P2, and P2 e P3 in the graph). If both the LSB and MSB 
are read incorrectly as a result of the error, we refer to that error as a multi-bit error. We make 
four observations from Figure 18. First, both LSB and MSB errors increase as the P/E cycle count 
increases, following an exponential trend. Second, ER © P1 errors increase at a much faster rate 
as the P/E cycle count increases, compared to the other types of cell state changes, and ER — P1 
errors become the dominant MSB error type when the P/E cycle count reaches 8K P/E cycles (6K is 


For the ER state, a linear fit has a 5.9% higher root mean square error than a power-law fit. However, we choose the linear 
fit due to its simplicity. 


36 Y. Luo et al. 


the cross-over point). This is because the electrons trapped in the cell during wearout prevent the 
cell from being set to very low threshold voltages. As a result, the threshold voltage distribution of 
the ER state shifts and widens more than the distributions of the other states, as we see in Figure 17. 
Third, multi-bit errors are less common, but they occur as early as at 1K P/E cycles. Only a large 
difference between the target and actual threshold voltage can lead to a multi-bit error, which is 
unlikely to happen. Fourth, MSBs have a 2.1x higher error rate than LSBs, on average across all 
P/E cycle counts. This is because the flash controller must use two read reference voltages to read a 
cell’s MSB, but needs only one read reference voltage to read a cell’s LSB. 
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Fig. 18. RBER due to P/E cycling errors vs. P/E cycle count. 


Figure 19 shows how the optimal read reference voltages change as the P/E cycle count increases. 
This figure contains three subfigures, each of which shows the optimal voltage for Va, Vp, and 
Vç (see Figure 1a). We make two observations from this figure. First, the optimal voltage for Vz 
increases rapidly as the P/E cycle count increases: after 16K P/E cycles, the voltage goes up by more 
than 20 voltage steps. Second, the optimal voltages for V, and Vç remain almost constant as the P/E 
cycle count increases: neither voltage changes by more than 4 voltage steps after 16K P/E cycles, as 
expected from the lack of change in P1, P2, and P3 distribution means shown in Figure 17. 
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Fig. 19. Optimal read reference voltages vs. P/E cycle count. 
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Insights. To compare the error characteristics of 3D NAND flash memory to that of planar 
NAND flash memory, we take the equivalent observations on planar NAND flash memory reported 
by prior works [14, 64, 81], and compare them to our findings for 3D NAND flash memory, which 
we just described. We find two key differences. First, for 3D NAND flash memory, the threshold 
voltage distributions for the P2 state and the P3 state shift to lower voltages as the P/E cycle count 
increases. In contrast, for planar NAND flash memory, the distributions of both states shift to higher 
voltages [14, 64, 81]. One possible source of this change is the increased impact of early retention 
loss with P/E cycle count, which lowers the threshold voltage of cells in higher-voltage states (i.e., 
P2 and P3) [23]. Second, for 3D NAND flash memory, the change in the mean threshold voltage 
of each state distribution exhibits a linear increase. However, in sub-20nm planar NAND flash 
memory, the change in the mean threshold voltage exhibits a power-law-based increase with P/E 
cycle count [64, 81]. In sub-20 nm planar NAND flash memory, the mean threshold voltage of each 
state distribution increases more rapidly at lower P/E cycle counts than in higher P/E cycle counts, 
resulting in the power-law-based behavior. However, we note that planar NAND flash memory 
using an older manufacturing process technology (e.g., 20-24 nm) exhibits a linear increase with 
P/E cycle count for the distribution mean [14], just as we observe for 3D NAND flash memory. 
Thus, there is evidence that when the manufacturing process technology scales below a certain size, 
the change in the distribution mean transitions from linear behavior to power-law-based behavior 
with respect to P/E cycle count. As a result, when future 3D NAND flash memory scales down 
to a sub-20 nm manufacturing process technology node, we might expect that it too will exhibit 
power-law behavior for the change in the distribution mean. We conclude that the differences we 
observe between the P/E cycling effect in 3D NAND flash memory and planar NAND flash memory 
are mainly caused by the use of a significantly different manufacturing process technology node. 


A.1.3 Program Interference. When a cell (which we call the aggressor cell) is being programmed, 
cell-to-cell program interference can cause the threshold voltage of nearby flash cells (which we 
call victim cells) to increase unintentionally [15, 16] (see Section 2.2). In 3D NAND flash memory, 
there are two types of program interference that can occur. The first, wordline-to-wordline program 
interference, affects victim cells along the z-axis of the cell that is programmed (see Figure 3). These 
victim cells are physically next to the cell that is programmed, and belong to the same bitline (and 
thus the same flash block). The second, bitline-to-bitline program interference, affects victim cells 
along the x-axis or y-axis of the cell that is programmed. Bitline-to-bitline program interference 
can affect victim cells in the same wordline (i.e., cells on the y-axis), or it can affect victim cells that 
belong to other flash blocks (i.e., cells on the x-axis). 

To quantitatively analyze the effect of program interference on cell threshold voltage and 
raw bit error rate, we use the same experimental data that we have for P/E cycling errors (see 
Section A.1.2). A correlation exists between the amount by which program interference changes 
the threshold voltage of a victim cell (AVvictim) and the threshold voltage change of the aggressor 
cell (AVaggressor) [15]. As a result of this interference correlation, the threshold voltage of a victim 
cell is dependent on the threshold voltage of the aggressor cell. The strength of this correlation can 


be quantified as ane, which is a property of the NAND device and is largely dependent on 


the distance between the cells [57]. After programming randomly-generated data to the victim cells 
and the aggressor cells, we estimate AVaggressor by calculating the threshold voltage difference 
between the aggressor cell’s threshold voltage in its final state and that in the ER state. We estimate 
AVvictim by calculating the difference between the victim cell’s threshold voltage with and without 
program interference.” 


12The cell threshold voltage without program interference is obtained by reading the cell before the next wordline is 
programmed. 
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Observations. Figure 20 shows the interference correlation for wordline-to-wordline interfe- 
rence and bitline-to-bitline interference on a victim cell, for aggressor cells of varying distance from 
the victim cell. For example, the victim cell in BL M, WL N has an interference correlation of 2.7% 
with the next wordline aggressor cell in BL M, WL N+1, which means that, if the threshold voltage 
of the aggressor cell increases by AV, the threshold voltage of the victim cell increases by 0.027AV 
due to wordline-to-wordline program interference. We make two observations from this figure. 
First, the interference correlation of the next wordline aggressor cell (i.e., 2.7%) is over an order 
of magnitude higher than that of any other aggressor cell, of which the maximum interference 
correlation is only 0.080% (the previous wordline aggressor cell in BL M, WL N-1). Thus, the program 
interference to the victim cell, is dominated by wordline-to-wordline interference from the next 
wordline. Second, all of the other types of interference have much smaller interference correlation 
values. 


Previous wordline 


Next bitline 


Next wordline+bitline 


e=- 


Fig. 20. Interference correlation for a victim cell, as a result of programming aggressor cells of varying 
distances from the victim cell. 


Figure 21 shows how much the threshold voltage of a victim cell shifts (AVyictim) when a 
neighboring aggressor cell is programmed to the P3 state, which generates the largest possible 
program interference. Each curve represents a certain program interference type (i.e., Next WL or 
Prev WL) and a certain state of the victim cell (V). The curves that have a significant amount of 
threshold voltage shift (e.g., >6 voltage steps) due to program interference are shown in Figure 21(a); 
the curves that have a small amount of threshold voltage shift are shown in Figure 21(b). We make 
three observations from Figure 21. First, the effect of program interference decreases as the P/E 
cycle count increases (along the x-axis, from left to right). As we discuss in Section A.1.2, electrons 
trapped in a flash cell due to wearout prevent the cell from returning to the lowest threshold 
voltage values during an erase operation. As a result, as the P/E cycle count increases, the mean 
threshold voltage of the ER state increases. This causes AVaggressor to decrease as the P/E cycle 
count increases, because the starting voltage of the aggressor cell increases but its target voltage 
after programming remains the same. As we discuss above, the interference correlation (i.e., the 
ratio between AVaggressor and AVzictim) is largely a function of the distance between flash cells. 
Thus, since AVaggressor decreases, AVvictim also decreases with the P/E cycle count. Second, the 
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amount of program interference induced by an aggressor cell in the next wordline decreases when 
the victim cell is in a higher-voltage state (Next WL curves in Figure 21a, from top to bottom). This 
is likely because the voltage difference between the aggressor cell and the victim cell is lower when 
the victim cell is in a higher-voltage state, reducing the the threshold voltage shift due to program 
interference. Third, the program interference induced by an aggressor cell in the previous wordline 
(Prev WL curves in Figure 21) affects the threshold voltage distribution of only the ER state for a 
victim cell, but it has little effect on the distributions of the other three states (i.e., P1, P2, P3). This 
is aresult of how programming takes place in NAND flash memory. A program operation can only 
increase the voltage of a cell due to circuit-level limitations. When the aggressor cell in the previous 
wordline is programmed, the victim cell is already in the ER state, and the victim cell’s voltage 
increases due to program interference. Some time later, the victim cell is programmed. If the target 
state of the victim cell is P1, P2, or P3, the programming operation needs to further increase the 
voltage of the cell, and any effects of program interference from the aggressor cell in the previous 
wordline are eliminated. If, however, the target state of the victim cell is ER, the programming 
operation does not change the victim cell’s voltage, and the effects of program interference from 
the aggressor cell in the previous wordline remain. 
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Fig. 21. Amount of threshold voltage shift due to program interference vs. P/E cycle count. 


Insights. We compare the program interference in 3D NAND flash memory to the program 
interference observed in planar NAND flash memory, as reported in prior work [15, 16]. We 
find one major difference. The maximum interference correlation of program interference from a 
directly-adjacent cell is 40% lower in 3D NAND flash memory (2.7%) than in state-of-the-art (20- 
24nm) planar NAND flash memory (4.5% [15]). This is corroborated by findings in prior work [80], 
which shows that 3D NAND flash memory has 84% lower program interference than 15-19 nm 
planar NAND flash memory. The lower interference correlation in 3D NAND flash memory is 
due to the larger manufacturing process technology node (30-50 nm for the chips we test) that 
it uses compared to state-of-the-art planar NAND flash memory. The amount of interference 
correlation between neighboring cells is a function of the distance between the cells [57]. In a larger 
manufacturing process technology node, the flash cells are farther away from each other, causing 
the interference correlation to decrease. We note that when future 3D NAND flash memory chips 
use smaller manufacturing process technology nodes, the impact of programming interference will 
increase, similar to what happened in planar NAND flash memory. 

Note that we are the first to compare how the threshold voltage shift caused by program 
interference changes with the P/E cycle count. As we discuss in our first observation for Figure 21, 
the program interference effect decreases as the P/E cycle count increases because the increasing 
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effects of wearout reduce the value of AVaggressor during programming. We conclude that the 40% 
reduction in the program interference effect we observe in 3D NAND flash memory compared 
to planar NAND flash memory is mainly caused by the difference in manufacturing process 
technology. 


A.2 Early Retention Loss 


In this section, we present the results and analysis of retention loss in 3D NAND flash memory in 
addition to the key findings in Section 4.3. We use the same methodology as described in Section 4.3. 

Observations. Figure 22 shows how the mean and the standard deviation of the threshold 
voltage distribution change with retention time. Each subfigure in the top row shows the mean 
for a different state; each subfigure in the bottom row show the standard deviation for a different 
state. The blue dots show the measured data; each orange line shows a linear trend line fitted to 
the measured data. The x-axis shows the retention time in log scale; the y-axis shows the mean 
or standard deviation value in voltage steps. We make five observations from this figure. First, 
the threshold voltage distribution shifts more when the retention time is low. This is the early 
retention loss phenomenon, which occurs because charge that is trapped near the surface of the 
charge trap layer is detrapped soon after programming. Second, as the retention time increases, 
the voltage values of cells in the P1, P2, and P3 states decrease, while the voltage values of cells 
in the ER state increase. This is because the cells in the ER state have negative threshold voltages, 
and hence they gain charge over retention time. Third, the threshold voltage distributions of the 
ER and P3 states shift faster than the distributions of the P1 and P2 states as the retention time 
increases. This is because the ER and P3 states have larger voltage differences from the ground 
than the other states. Fourth, retention loss has little effect on the width of the threshold voltage 
distribution (i.e., standard deviation values change by less than 1 voltage step after 24 days). This is 
because the effects of retention loss (i.e., charge leakage) impact cells at a similar rate, causing all 
of the cells within the threshold voltage distribution to lose a similar amount of voltage. Fifth, the 
correlation between any distribution parameter (V) and the retention time (t) can be modeled as a 
linear function (shown by the dotted lines in Figure 22): V = A- log(t) + B. A and B are constants 
that change based on which parameter V is modeling (i.e., the threshold voltage distribution mean 
or standard deviation). Prior work shows that planar NAND flash memory has a similar trend for 
retention loss, even though it uses a different flash cell design. We have already compared and 
evaluated the differences between 3D NAND and planar NAND flash memory in retention loss 
speed in Section 4.3, and provided more detail about the linear function that models the threshold 
voltage distribution parameters in Section 5.2. 

Figure 23 shows how the RBER increases with retention time for a block that has endured 10K 
P/E cycles. The top graph breaks down the errors according to the change in cell state as a result of 
the errors; the bottom graph breaks down the errors into MSB and LSB page errors. We make two 
observations from Figure 23, in addition to our observations in Section 4.3. First, retention errors 
are dominated by P2 + P3 errors, because the threshold voltage distribution of the P3 state not 
only shifts more but also widens more with retention time than the distributions of the other states 
(see Figure 22). Although the distribution of the ER state also shifts significantly, there are fewer 
ER © P1 errors to begin with. Second, the MSB error rate increases faster than the LSB error rate 
as the retention time increases. This is because as the distributions of both the ER and P3 states 
shift more than those of the P1 and P2 states, cells in the ER and P3 states are more likely to have 
errors. These errors (ER — P1 and P2 e P3) affect the MSB of the cell. 

Insights. We compare the errors due to retention loss in 3D NAND flash memory to those in 
planar NAND flash memory, as reported in prior work [6, 7, 69]. We find another major difference 
in 3D NAND flash memory in terms of threshold voltage distribution, in addition to those discussed 
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Fig. 22. Mean and standard deviation of our Gaussian threshold voltage distribution model of each state, 


versus retention time. 
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Fig. 23. RBER vs. retention time, broken down by (a) the state transition of each flash cell, and (b) MSB or 
LSB page. 


in Section 4.3. We find that the retention loss phenomenon we observe in 3D NAND flash memory 
(1) shifts the threshold voltage distributions of the P1, P2 and P3 states lower, and (2) has little effect 
on the width of the distribution of each state. In contrast, the retention loss phenomenon observed 
in planar NAND flash memory (1) does not shift the P1 and P2 state distributions by much, and 
(2) increases the width of each state’s distribution significantly [6]. This indicates that a mechanism 
that adjusts the optimal read reference voltage to the threshold voltage shift caused by retention 
loss can be more effective on 3D NAND flash memory than on planar NAND flash memory, because 
the distributions shift by a greater amount (indicating a greater need for voltage adjustment) with 
a smaller amount of overlap between two threshold voltage distributions (reducing the number of 
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read errors when the optimal read reference voltage is used). We conclude that, due to the early 
retention loss phenomenon we observe in 3D NAND flash memory, the threshold voltage of a flash 
cell changes quickly within several hours after programming, leading to significant changes in 
RBER and optimal read reference voltage values. 


A.3 Read-Induced Errors 


In this section, we analyze how each type of read-induced error affects the RBER and the threshold 
voltage distribution of 3D NAND flash memory. 


A.3.1 Read Errors. A read error is a type of read-induced error where two reads to a flash 
cell may return different data values if the read reference voltage used to read the cell is close 
to the cell’s threshold voltage [24, 29, 42] (see Section 2.2). A read error adds uncertainty to the 
outcome of every read operation performed by the SSD controller. However, despite the potential 
for widespread impact, read errors are not well-studied by prior work. 

To quantify read errors, we use the data we collected in Section 4.3. For each cell, we see if the 
actual read outcome (i.e., the bit value output by the flash controller after a read operation) matches 
the expected read outcome (i.e., the value that the read should have returned based on the current 
voltage of the flash cell). We determine the expected read outcome by comparing V,ef with V;p (i.e., 
we expect to read 1 if Vin < V-er, because V,ep is high enough that it should turn on the cell). We 
obtain V;, by combining the outcomes of multiple reads when sweeping the read reference voltage, 
thus we expect that the combined output eliminates the impact of read errors and is thus accurate. 
We say that a read error occurs if the actual read outcome and the expected read outcome do not 
match. 

Observations. Figure 24 shows how the read error rate changes as a function of the read offset 
(i.e., Vref — Vin). We observe that, as the absolute value of the read offset increases, the read error 
rate decreases exponentially. This is likely because when V, is closer to V,p (i.e., when Veer — Vin 
has a smaller absolute value), the amount of noise (i.e., voltage fluctuations) in the sense amplifier 
increases exponentially [24, 29]. The larger amount of noise increases the likelihood that the sense 
amplifier incorrectly detects whether the cell turns on, which leads to a larger probability that a 
read error occurs. 
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Fig. 24. Read error rate vs. read offset (V-e¢ — Vrh). 


Figure 25 shows the correlation between the read error rate and the total RBER in a flash page. 
We observe that the read error rate is linearly correlated with the overall RBER. This is because, 
when the RBER is high, the threshold voltage distributions of neighboring states overlap with each 
other by a greater amount. This causes a larger number of cells to be close to the read reference 
voltage value, increasing the probability that a read error occurs (see Figure 24). 
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Fig. 25. Relationship between the read error rate and the RBER. 


Insights. We are the first to discover and quantify the extent of read errors, and to show the 
correlation of these errors with the RBER and with the read reference voltage. We conclude that 
read errors are correlated with the read offset (i.e., Ver — Vin) and the overall RBER of the flash 


page. 


A.3.2 Read Disturb Errors. Read disturb errors occur when a read operation to one page in a 
flash block may introduce errors in other, unread pages in the same block [5, 76] (see Section 2.2). 
Read disturb errors are caused by the high pass-through voltage applied to cells in the unread 
pages. 

To characterize read disturb errors, we first randomly select 11 flash blocks and wear out each 
block to 10K P/E cycles by repeatedly erasing and programming pseudorandomly generated data 
into each page of each block. Then, we program pseudorandomly-generated data to each page 
of each flash block. To minimize the impact of other errors, especially retention errors due to 
early retention loss, we wait until the data has a 2-day retention time before inducing read disturb. 
This ensures that, according to our results in Section 4.3, after 2 days, retention loss has slowed 
down and can only shift the threshold voltage by at most 1 voltage step during the relatively short 
characterization process (~9 h). To induce read disturb in the flash block, we repeatedly read from a 
wordline within the block for up to 900K times (i.e., up to 900K read disturbs). During this process, 
to characterize the read disturb effect, we obtain the RBER and threshold voltage distribution at ten 
different read disturb counts from 0 to 900K. 

Observations. Figure 26 shows how the mean and standard deviation of the threshold voltage 
distribution change with read disturb count. Each subfigure in the top row shows the mean for 
a different state; each subfigure in the bottom row shows the standard deviation for a different 
state. The blue dots shows the measured data; each orange line shows a linear trend line fitted 
to the measured data. The x-axis shows the P/E cycle count; the y-axis shows the distribution 
parameters in voltage steps. We make three observations from this figure. First, the read disturb 
effect increases the mean threshold voltage of the ER state significantly, by ~8 voltage steps after 
900K read disturbs. In contrast, the mean threshold voltages of the programmed states change 
by only a small amount (<3 voltage steps). The increase in the mean threshold voltage is lower 
for a higher V;, state. This is because the impact of read disturb is correlated with the difference 
between the pass-through voltage (see Section 2.1) and the threshold voltage of a cell. When the 
difference is larger (i.e., when the threshold voltage of a cell is lower), the impact of read disturb 
increases. In fact, we observe that the threshold voltage distribution of the P3 state even shifts to 
slightly lower voltage values during the experiment, because read disturb has little effect on cells in 
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the P3 state, and the impact of retention loss dominates. Second, the distribution width of each 
state (i.e., standard deviation) decreases slightly as the read disturb count increases, by <0.2 voltage 
steps after 900K read disturbs. Third, the change in each distribution parameter can be modeled as 
a linear function of the read disturb count (as shown by the orange dotted lines). This shows that 
read disturb in 3D NAND flash memory follows a similar linear trend as that observed in planar 
NAND flash memory by prior work [5]. 
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Fig. 26. Mean and standard deviation of threshold voltage distribution of each state, vs. read disturb count. 


Figure 27 plots how RBER increases with read disturb count for a flash block that has endured 
10K P/E cycles. The top graph breaks down the errors according to the change in cell state as a 
result of the errors; the bottom graph breaks down the errors into MSB and LSB errors. We make 
three observations from Figure 27. First, EReP1 errors increase significantly with read disturb 
count, whereas P1<>P2 and P2<>P3 errors do not. This is because the ER state threshold voltage 
distribution shifts significantly with read disturb count (see Figure 26), reducing the threshold 
voltage difference between the ER and P1 states. Second, MSB errors increase much faster than LSB 
errors with read disturb count because ER>P1 errors are a type of MSB error, and they increase 
significantly with read disturb count. Third, the increase in RBER with read disturb count follows a 
linear trend (as shown by the dotted line in Figure 27b), which is similar to the observation made 
for planar NAND flash memory by prior work [5]. 

Figure 28 shows how the optimal read reference voltages change with read disturb count. The 
three subfigures show the optimal voltages for Vz, Vp, and Ve. We make two observations from this 
figure. First, the optimal voltages for V, and V, change by relatively little as the read disturb count 
increases (<3 voltage steps after 900K read disturbs), whereas the optimal V, changes more with 
the read disturb count. This is because read disturb causes the threshold voltage distributions of 
lower-voltage states to change by a greater amount, which requires the read reference voltages 
separating the lower-voltage states (e.g., Va) to change more. Second, the increase in the optimal V, 
follows a linear trend with read disturb count, because the ER state threshold voltage distribution 
shifts linearly (as we see from Figure 26). 

Insights. We compare the read disturb effect that we observe in 3D NAND flash memory to that 
observed in planar NAND flash memory by prior work [5]. We make the observation that, although 
RBER increases linearly with read disturb count in both 3D NAND and planar NAND flash memory, 
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Fig. 27. RBER vs. read disturb count, broken down by (a) the state transition of each flash cell, and (b) MSB 


or LSB page. 
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Fig. 28. Optimal read reference voltages vs. read disturb count. 


the slope of the increase (i.e., the sensitivity of the RBER to read disturb) at 10K P/E cycles is 96.7% 
lower in 3D NAND flash memory than that in planar NAND flash memory [5]. We believe that this 
difference in the sensitivity to read disturb effect is due to the use of a larger process technology 
node (30-40 nm) in current 3D NAND flash memory. The comparable planar NAND flash memory 
results from prior work are collected on 20-24 nm planar NAND flash memory devices [5]. We 
expect the read disturb effect in 3D NAND flash memory to increase in the future as the process 
technology node size shrinks. We conclude that the 96.7% reduction in the read disturb effect we 
observe in 3D NAND flash memory compared to planar NAND flash memory is mainly caused 
by the difference in manufacturing process technology nodes of the two types of NAND flash 
memories. 


A.4 Layer-to-Layer Process Variation 


In this section, we present new results and analyses of the layer-to-layer process variation phenom- 
enon in 3D NAND flash memory, in addition to the key findings we already presented in Section 4.2. 
We use the same methodology as we describe in Section 4.2. 
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Figure 29 shows how the threshold voltage distribution mean and standard deviation of each 
state changes with layer number, for a flash block that has endured 10K P/E cycles. Each subfigure 
in the top row shows the mean for a different state; each subfigure in the bottom row shows the 
standard deviation for a different state. We make two observations from this figure. First, the ER 
state threshold voltage increases by as much as 25 voltage steps as the layer number changes, while 
the mean threshold voltages of the other three states do not vary by much. This is because the 
threshold voltage of a cell in ER state is set after an erase operation, and the value it is set to is 
a function of manufacturing process variation and of wearout. In contrast, the threshold voltage 
of a cell in one of the other states (P1, P2, or P3) is set to a fixed target voltage value regardless of 
process variation [3, 69, 89, 91] (see Section 2.1). Since only the voltage of the ER state is affected by 
layer-to-layer process variation, only one of the read reference voltages, V,, changes with the layer 
number, as we already observed in Figure 6. Second, the distribution widths of ER and P1 states 
(i.e., their standard deviations) increase in the top layers, and decrease in the bottom layers. This 
pattern is similar to the pattern of how the RBER changes with layer number, which we show in 
Figure 5 (Section 4.2). A wider threshold voltage distribution increases the overlap of neighboring 
distributions, leading to more errors in the top layer. However, the distribution widths of the P2 and 
P3 states mainly decrease as layer number increases. Unfortunately, we are unable to completely 
explain why mean threshold voltage and distribution width change differently with layer number 
for different states because we do not have exact circuit-level information about layer-to-layer 
process variation. 
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Fig. 29. Mean and standard deviation of our Gaussian threshold voltage distribution model of each state, 


versus layer number. 


We conclude that layer-to-layer process variation significantly impacts the threshold voltage 
distribution and leads to large variations in RBER and optimal read reference voltages across layers. 


A.5 Bitline-to-Bitline Process Variation 

We perform an analysis of the variation of RBER and threshold voltage distribution along the y-axis 
(i.e., across groups of bitlines) for a flash block that has endured 10K P/E cycles. We use a similar 
methodology to our layer-to-layer process variation experiments (see Section 4.2). 
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Figure 30 shows how the threshold voltage distribution mean and standard deviation of each 
state changes with layer number, for a flash block that has endured 10K P/E cycles. Each subfigure 
in the top row shows the mean for a different state; each subfigure in the bottom row shows the 
standard deviation for a different state. Note that we normalize the number of bitlines from 0 to 100, 
by multiplying the actual bitline number with a constant, to maintain the anonymity of the chip 
vendors. We make two observations from this figure. First, the variations in mean threshold voltage 
and the distribution width (i-e., standard deviation) are much smaller in this figure compared to 
that observed in Figure 29 for layer-to-layer variation (Appendix A.4). This indicates that bitline- 
to-bitline process variation is much smaller compared to layer-to-layer process variation in 3D 
NAND flash memory. Second, we observe that the pattern of the mean threshold voltage repeats 
periodically, for every 25 bitlines. We believe that this indicates a repetitive architecture in the way 
that the 3D NAND flash memory chip is organized (for example, each block may be made up of 
four arrays of flash cells that are connected together). Unfortunately, we cannot completely explain 
this behavior without access to circuit-level design information that is proprietary to NAND flash 
memory vendors. 
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Fig. 30. Mean and standard deviation of our Gaussian threshold voltage distribution model of each state, 


versus bitline number. 


Figures 31 and 32 show how the RBER and optimal read reference voltages change with bitline 
number, for a flash block that has endured 10K P/E cycles. We observe that neither RBER nor the 
optimal read reference voltages change by much across bitlines. This indicates that the changes 
that we observe in Figure 30 may not be significant enough to lead to variation in the reliability 
of different bitlines. We conclude that bitline-to-bitline process variation is much smaller than 
layer-to-layer process variation in 3D NAND flash memory. 
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Fig. 32. Optimal read reference voltages vs. bitline number. 


